Larger-Scale Transformers for Multilingual Masked Language Modeling

Abstract

Recent work has demonstrated the effectiveness of cross-lingual languagemodel pretraining for cross-lingual understanding. In this study, we presentthe results of two larger multilingual masked language models, with 3.5B and10.7B parameters. Our two new models dubbed XLM-R XL and XLM-R XXL outperformXLM-R by 1.8% and 2.4% average accuracy on XNLI. Our model also outperforms theRoBERTa-Large model on several English tasks of the GLUE benchmark by 0.3% onaverage while handling 99 more languages. This suggests pretrained models withlarger capacity may obtain both strong performance on high-resource languageswhile greatly improving low-resource languages. We make our code and modelspublicly available.

Quick Read (beta)

loading the full paper ...