Unsupervised Cross-lingual Representation Learning at Scale

Abstract

This paper shows that pretraining multilingual language models at scale leadsto significant performance gains for a wide range of cross-lingual transfertasks. We train a Transformer-based masked language model on one hundredlanguages, using more than two terabytes of filtered CommonCrawl data. Ourmodel, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on avariety of cross-lingual benchmarks, including +13.8% average accuracy on XNLI,+12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-Rperforms particularly well on low-resource languages, improving 11.8% in XNLIaccuracy for Swahili and 9.2% for Urdu over the previous XLM model. We alsopresent a detailed empirical evaluation of the key factors that are required toachieve these gains, including the trade-offs between (1) positive transfer andcapacity dilution and (2) the performance of high and low resource languages atscale. Finally, we show, for the first time, the possibility of multilingualmodeling without sacrificing per-language performance; XLM-Ris very competitivewith strong monolingual models on the GLUE and XNLI benchmarks. We will makeXLM-R code, data, and models publicly available.

Quick Read (beta)

loading the full paper ...