CINO: A Chinese Minority Pre-trained Language Model

  • 2022-09-21 02:43:35
  • Ziqing Yang, Zihang Xu, Yiming Cui, Baoxin Wang, Min Lin, Dayong Wu, Zhigang Chen
  • 0

Abstract

Multilingual pre-trained language models have shown impressive performance oncross-lingual tasks. It greatly facilitates the applications of naturallanguage processing on low-resource languages. However, there are still somelanguages that the current multilingual models do not perform well on. In thispaper, we propose CINO (Chinese Minority Pre-trained Language Model), amultilingual pre-trained language model for Chinese minority languages. Itcovers Standard Chinese, Yue Chinese, and six other ethnic minority languages.To evaluate the cross-lingual ability of the multilingual model on ethnicminority languages, we collect documents from Wikipedia and news websites, andconstruct two text classification datasets, WCM (Wiki-Chinese-Minority) andCMNews (Chinese-Minority-News). We show that CINO notably outperforms thebaselines on various classification tasks. The CINO model and the datasets arepublicly available at http://cino.hfl-rc.com.

 

Quick Read (beta)

loading the full paper ...