CINO: A Chinese Minority Pre-trained Language Model

Abstract

Multilingual pre-trained language models have shown impressive performance oncross-lingual tasks. It greatly facilitates the applications of naturallanguage processing on low-resource languages. However, there are still somelanguages that the current multilingual models do not perform well on. In thispaper, we propose CINO (Chinese Minority Pre-trained Language Model), amultilingual pre-trained language model for Chinese minority languages. Itcovers Standard Chinese, Yue Chinese, and six other ethnic minority languages.To evaluate the cross-lingual ability of the multilingual model on ethnicminority languages, we collect documents from Wikipedia and news websites, andconstruct two text classification datasets, WCM (Wiki-Chinese-Minority) andCMNews (Chinese-Minority-News). We show that CINO notably outperforms thebaselines on various classification tasks. The CINO model and the datasets arepublicly available at http://cino.hfl-rc.com.

Quick Read (beta)

loading the full paper ...