Abstract
Evaluating Large Language Models (LLMs) in low-resource and linguisticallydiverse languages remains a significant challenge in NLP, particularly forlanguages using non-Latin scripts like those spoken in India. Existingbenchmarks predominantly focus on English, leaving substantial gaps inassessing LLM capabilities in these languages. We introduce MILU, a Multi taskIndic Language Understanding Benchmark, a comprehensive evaluation benchmarkdesigned to address this gap. MILU spans 8 domains and 42 subjects across 11Indic languages, reflecting both general and culturally specific knowledge.With an India-centric design, incorporates material from regional andstate-level examinations, covering topics such as local history, arts,festivals, and laws, alongside standard subjects like science and mathematics.We evaluate over 45 LLMs, and find that current LLMs struggle with MILU, withGPT-4o achieving the highest average accuracy at 72 percent. Open multilingualmodels outperform language-specific fine-tuned models, which perform onlyslightly better than random baselines. Models also perform better in highresource languages as compared to low resource ones. Domain-wise analysisindicates that models perform poorly in culturally relevant areas like Arts andHumanities, Law and Governance compared to general fields like STEM. To thebest of our knowledge, MILU is the first of its kind benchmark focused on Indiclanguages, serving as a crucial step towards comprehensive cultural evaluation.All code, benchmarks, and artifacts are publicly available to foster openresearch.