MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation

Abstract

Traditional benchmarks struggle to evaluate increasingly sophisticatedlanguage models in multilingual and culturally diverse contexts. To addressthis gap, we introduce MMLU-ProX, a comprehensive multilingual benchmarkcovering 13 typologically diverse languages with approximately 11,829 questionsper language. Building on the challenging reasoning-focused design of MMLU-Pro,our framework employs a semi-automatic translation process: translationsgenerated by state-of-the-art large language models (LLMs) are rigorouslyevaluated by expert annotators to ensure conceptual accuracy, terminologicalconsistency, and cultural relevance. We comprehensively evaluate 25state-of-the-art LLMs using 5-shot chain-of-thought (CoT) and zero-shotprompting strategies, analyzing their performance across linguistic andcultural boundaries. Our experiments reveal consistent performance degradationfrom high-resource languages to lower-resource ones, with the best modelsachieving over 70% accuracy on English but dropping to around 40% for languageslike Swahili, highlighting persistent gaps in multilingual capabilities despiterecent advances. MMLU-ProX is an ongoing project; we are expanding ourbenchmark by incorporating additional languages and evaluating more languagemodels to provide a more comprehensive assessment of multilingual capabilities.

Quick Read (beta)

loading the full paper ...