Abstract
Large language models (LLMs) excel in high-resource languages but strugglewith low-resource languages (LRLs), particularly those spoken by minoritycommunities in China, such as Tibetan, Uyghur, Kazakh, and Mongolian. Tosystematically track the progress in these languages, we introduce MiLiC-Eval,a benchmark designed for minority languages in China, featuring 24K instancesacross 9 tasks. MiLiC-Eval focuses on underrepresented writing systems. Itsparallelism between tasks and languages can provide a faithful and fine-grainedassessment of linguistic and problem-solving skills. Our evaluation revealsthat open-source LLMs perform poorly on syntax-intensive tasks and multi-scriptlanguages. We further demonstrate how MiLiC-Eval can help advance LRL researchin handling diverse writing systems and understanding the process of languageadaptation.