Abstract
Multilingual language models are widely used to extend NLP systems tolow-resource languages. However, concrete evidence for the effects ofmultilinguality on language modeling performance in individual languagesremains scarce. Here, we pre-train over 10,000 monolingual and multilinguallanguage models for over 250 languages, including multiple language familiesthat are under-studied in NLP. We assess how language modeling performance ineach language varies as a function of (1) monolingual dataset size, (2) addedmultilingual dataset size, (3) linguistic similarity of the added languages,and (4) model size (up to 45M parameters). We find that in moderation, addingmultilingual data improves low-resource language modeling performance, similarto increasing low-resource dataset sizes by up to 33%. Improvements depend onthe syntactic similarity of the added multilingual data, with marginaladditional effects of vocabulary overlap. However, high-resource languagesconsistently perform worse in multilingual pre-training scenarios. As datasetsizes increase, adding multilingual data begins to hurt performance for bothlow-resource and high-resource languages, likely due to limited model capacity(the "curse of multilinguality"). These results suggest that massivelymultilingual pre-training may not be optimal for any languages involved, butthat more targeted models can significantly improve performance.