LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models

Abstract

Recent advancements in large language models (LLMs) based embedding modelshave established new state-of-the-art benchmarks for text embedding tasks,particularly in dense vector-based retrieval. However, these modelspredominantly focus on English, leaving multilingual embedding capabilitieslargely unexplored. To address this limitation, we present LUSIFER, a novelzero-shot approach that adapts LLM-based embedding models for multilingualtasks without requiring multilingual supervision. LUSIFER's architecturecombines a multilingual encoder, serving as a language-universal learner, withan LLM-based embedding model optimized for embedding-specific tasks. Thesecomponents are seamlessly integrated through a minimal set of trainableparameters that act as a connector, effectively transferring the multilingualencoder's language understanding capabilities to the specialized embeddingmodel. Additionally, to comprehensively evaluate multilingual embeddingperformance, we introduce a new benchmark encompassing 5 primary embeddingtasks, 123 diverse datasets, and coverage across 14 languages. Extensiveexperimental results demonstrate that LUSIFER significantly enhances themultilingual performance across various embedding tasks, particularly formedium and low-resource languages, without requiring explicit multilingualtraining data.

Quick Read (beta)

loading the full paper ...