Abstract
While recent large language models (LLMs) demonstrate remarkable abilities inresponding to queries in diverse languages, their ability to handle longmultilingual contexts is unexplored. As such, a systematic evaluation of thelong-context capabilities of LLMs in multilingual settings is crucial,specifically in the context of information retrieval. To address this gap, weintroduce the MultiLingual Needle-in-a-Haystack (MLNeedle) test, designed toassess a model's ability to retrieve relevant information (the needle) from acollection of multilingual distractor texts (the haystack). This test serves asan extension of the multilingual question-answering task, encompassing bothmonolingual and cross-lingual retrieval. We evaluate four state-of-the-art LLMson MLNeedle. Our findings reveal that model performance can vary significantlywith language and needle position. Specifically, we observe that modelperformance is the lowest when the needle is (i) in a language outside theEnglish language family and (ii) located in the middle of the input context.Furthermore, although some models claim a context size of $8k$ tokens orgreater, none demonstrate satisfactory cross-lingual retrieval performance asthe context length increases. Our analysis provides key insights into thelong-context behavior of LLMs in multilingual settings to guide futureevaluation protocols. To our knowledge, this is the first study to investigatethe multilingual long-context behavior of LLMs.