ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval

Abstract

State-of-the-art neural retrievers predominantly focus on high-resourcelanguages like English, which impedes their adoption in retrieval scenariosinvolving other languages. Current approaches circumvent the lack ofhigh-quality labeled data in non-English languages by leveraging multilingualpretrained language models capable of cross-lingual transfer. However, thesemodels require substantial task-specific fine-tuning across multiple languages,often perform poorly in languages with minimal representation in thepretraining corpus, and struggle to incorporate new languages after thepretraining phase. In this work, we present a novel modular dense retrievalmodel that learns from the rich data of a single high-resource language andeffectively zero-shot transfers to a wide array of languages, therebyeliminating the need for language-specific labeled data. Our model, ColBERT-XM,demonstrates competitive performance against existing state-of-the-artmultilingual retrievers trained on more extensive datasets in variouslanguages. Further analysis reveals that our modular approach is highlydata-efficient, effectively adapts to out-of-distribution data, andsignificantly reduces energy consumption and carbon emissions. By demonstratingits proficiency in zero-shot scenarios, ColBERT-XM marks a shift towards moresustainable and inclusive retrieval systems, enabling effective informationaccessibility in numerous languages. We publicly release our code and modelsfor the community.

Quick Read (beta)

loading the full paper ...