BnMMLU: Measuring Massive Multitask Language Understanding in Bengali

Abstract

The Massive Multitask Language Understanding (MMLU) benchmark has been widelyused to evaluate language models across various domains. However, existing MMLUdatasets primarily focus on high-resource languages such as English, whichleaves low-resource languages like Bengali underrepresented. In this paper, weintroduce BnMMLU, a benchmark to evaluate the multitask language understandingcapabilities of Bengali in language models. The dataset spans 23 domains,including science, humanities, mathematics and general knowledge and isstructured in a multiple-choice format to assess factual knowledge,application-based problem-solving and reasoning abilities of language models.It consists of 138,949 question-option pairs. We benchmark several proprietaryand open-source large language models (LLMs) on the BnMMLU test set.Additionally, we annotate the test set with three cognitive categories-factualknowledge, procedural application and reasoning-to gain deeper insights intomodel strengths and weaknesses across various cognitive tasks. The resultsreveal significant performance gaps, highlighting the need for improvedpre-training and fine-tuning strategies tailored to Bengali data. We releasethe dataset and benchmark results to facilitate further research in this area.

Quick Read (beta)

loading the full paper ...