BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models

Abstract

Previous multilingual benchmarks focus primarily on simple understandingtasks, but for large language models(LLMs), we emphasize proficiency ininstruction following, reasoning, long context understanding, code generation,and so on. However, measuring these advanced capabilities across languages isunderexplored. To address the disparity, we introduce BenchMAX, a multi-waymultilingual evaluation benchmark that allows for fair comparisons of theseimportant abilities across languages. To maintain high quality, three distinctnative-speaking annotators independently annotate each sample within all tasksafter the data was machine-translated from English into 16 other languages.Additionally, we present a novel translation challenge stemming from datasetconstruction. Extensive experiments on BenchMAX reveal varying effectiveness ofcore capabilities across languages, highlighting performance gaps that cannotbe bridged by simply scaling up model size. BenchMAX serves as a comprehensivemultilingual evaluation platform, providing a promising test bed to promote thedevelopment of multilingual language models. The dataset and code are publiclyaccessible.

Quick Read (beta)

loading the full paper ...