TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages

Abstract

Being able to thoroughly assess massive multi-task language understanding(MMLU) capabilities is essential for advancing the applicability ofmultilingual language models. However, preparing such benchmarks in highquality native language is often costly and therefore limits therepresentativeness of evaluation datasets. While recent efforts focused onbuilding more inclusive MMLU benchmarks, these are conventionally built usingmachine translation from high-resource languages, which may introduce errorsand fail to account for the linguistic and cultural intricacies of the targetlanguages. In this paper, we address the lack of native language MMLU benchmarkespecially in the under-represented Turkic language family with distinctmorphosyntactic and cultural characteristics. We propose two benchmarks forTurkic language MMLU: TUMLU is a comprehensive, multilingual, and nativelydeveloped language understanding benchmark specifically designed for Turkiclanguages. It consists of middle- and high-school level questions spanning 11academic subjects in Azerbaijani, Crimean Tatar, Karakalpak, Kazakh, Tatar,Turkish, Uyghur, and Uzbek. We also present TUMLU-mini, a more concise,balanced, and manually verified subset of the dataset. Using this dataset, wesystematically evaluate a diverse range of open and proprietary multilinguallarge language models (LLMs), including Claude, Gemini, GPT, and LLaMA,offering an in-depth analysis of their performance across different languages,subjects, and alphabets. To promote further research and development inmultilingual language understanding, we release TUMLU-mini and allcorresponding evaluation scripts.

Quick Read (beta)

loading the full paper ...