MC^2: A Multilingual Corpus of Minority Languages in China

Abstract

Large-scale corpora play a vital role in the construction of large languagemodels (LLMs). However, existing LLMs exhibit limited abilities inunderstanding low-resource languages, including the minority languages inChina, due to a lack of training data. To improve the accessibility of theselanguages, we present MC^2, a Multilingual Corpus of Minority Languages inChina, which is the largest open-source corpus so far. It encompasses fourunderrepresented languages, i.e., Tibetan, Uyghur, Kazakh in the Kazakh Arabicscript, and Mongolian in the traditional Mongolian script. Notably, two writingsystems in MC^2 are long neglected in previous corpora. As we identify seriouscontamination in the low-resource language split in the existing multilingualcorpora, we propose a quality-centric solution for collecting MC^2,prioritizing quality and accuracy while enhancing representativeness anddiversity. By in-depth analysis, we demonstrate the new research challengesMC^2 brings, such as long-text modeling and multiplicity of writing systems. Wehope MC^2 can help enhance the equity of the underrepresented languages inChina and provide a reliable data foundation for further research onlow-resource languages.

Quick Read (beta)

loading the full paper ...