Abstract
Language models have made remarkable advancements in understanding andgenerating human language, achieving notable success across a wide array ofapplications. However, evaluating these models remains a significant challenge,particularly for resource-limited languages such as Turkish. To address thisgap, we introduce the Turkish MMLU (TR-MMLU) benchmark, a comprehensiveevaluation framework designed to assess the linguistic and conceptualcapabilities of large language models (LLMs) in Turkish. TR-MMLU is constructedfrom a carefully curated dataset comprising 6200 multiple-choice questionsacross 62 sections, selected from a pool of 280000 questions spanning 67disciplines and over 800 topics within the Turkish education system. Thisbenchmark provides a transparent, reproducible, and culturally relevant toolfor evaluating model performance. It serves as a standard framework for TurkishNLP research, enabling detailed analyses of LLMs' capabilities in processingTurkish text and fostering the development of more robust and accurate languagemodels. In this study, we evaluate state-of-the-art LLMs on TR-MMLU, providinginsights into their strengths and limitations for Turkish-specific tasks. Ourfindings reveal critical challenges, such as the impact of tokenization andfine-tuning strategies, and highlight areas for improvement in model design. Bysetting a new standard for evaluating Turkish language models, TR-MMLU aims toinspire future innovations and support the advancement of Turkish NLP research.