MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages

Abstract

We present MultiLoKo, a new benchmark for evaluating multilinguality in LLMscovering 31 languages. MultiLoKo consists of three partitions: a main partitionconsisting of 500 questions per language, separately sourced to be locallyrelevant to the specific language, and two translated partitions, containinghuman-authored translations from 30 non-English languages to English and viceversa. For comparison, we also release corresponding machine-authoredtranslations. The data is equally distributed over two splits: a dev split anda blind, out-of-distribution test split. MultiLoKo can be used to study avariety of questions regarding the multilinguality of LLMs as well asmeta-questions about multilingual benchmark creation. We compute MultiLoKoscores for 11 base and chat models marketed to be multilingual and study theiraverage performance, their performance parity across languages, how much theirability to answer questions depends on the question language, and whichlanguages are most difficult. None of the models we studied performs well onMultiLoKo, as indicated by low average scores as well as large differencesbetween the best and worst scoring languages. Furthermore, we find asubstantial effect of the question language, indicating sub-optimal knowledgetransfer between languages. Lastly, we find that using local vsEnglish-translated data can result in differences more than 20 points for thebest performing models, drastically change the estimated difficulty of somelanguages. For using machines instead of human translations, we find a weakereffect on ordering of language difficulty, a larger difference in modelrankings, and a substantial drop in estimated performance for all models.

Quick Read (beta)

loading the full paper ...