Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models

Abstract

While recent large language models (LLMs) demonstrate remarkable abilities inresponding to queries in diverse languages, their ability to handle longmultilingual contexts is unexplored. As such, a systematic evaluation of thelong-context capabilities of LLMs in multilingual settings is crucial,specifically in the context of information retrieval. To address this gap, weintroduce the MultiLingual Needle-in-a-Haystack (MLNeedle) test, designed toassess a model's ability to retrieve relevant information (the needle) from acollection of multilingual distractor texts (the haystack). This test serves asan extension of the multilingual question-answering task, encompassing bothmonolingual and cross-lingual retrieval. We evaluate four state-of-the-art LLMson MLNeedle. Our findings reveal that model performance can vary significantlywith language and needle position. Specifically, we observe that modelperformance is the lowest when the needle is (i) in a language outside theEnglish language family and (ii) located in the middle of the input context.Furthermore, although some models claim a context size of $8k$ tokens orgreater, none demonstrate satisfactory cross-lingual retrieval performance asthe context length increases. Our analysis provides key insights into thelong-context behavior of LLMs in multilingual settings to guide futureevaluation protocols. To our knowledge, this is the first study to investigatethe multilingual long-context behavior of LLMs.

Quick Read (beta)

loading the full paper ...