Abstract
English-centric large language models (LLMs) often show strong multilingualcapabilities. However, their multilingual performance remains unclear and isunder-evaluated for many other languages. Most benchmarks for multilingualityfocus on classic NLP tasks or cover a minimal number of languages. We introduceMEXA, a method for assessing the multilingual capabilities of pre-trainedEnglish-centric LLMs using parallel sentences, which are available for morelanguages than existing downstream tasks. MEXA leverages that English-centricLLMs use English as a pivot language in their intermediate layers. MEXAcomputes the alignment between English and non-English languages using parallelsentences to evaluate the transfer of language understanding from English toother languages. This alignment can be used to estimate model performance indifferent languages. We conduct controlled experiments using various paralleldatasets (FLORES-200 and Bible), models (Llama family, Gemma family, Mistral,and OLMo), and established downstream tasks (Belebele, m-MMLU, and m-ARC). Weexplore different methods to compute embeddings in decoder-only models. Ourresults show that MEXA, in its default settings, achieves an average Pearsoncorrelation of 0.90 between its predicted scores and actual task performanceacross languages. This suggests that MEXA is a reliable method for estimatingthe multilingual capabilities of English-centric LLMs, providing a clearerunderstanding of their multilingual potential and the inner workings of LLMs.Leaderboard: https://cis-lmu-mexa.hf.space, Code:https://github.com/cisnlp/MEXA.