Abstract
Despite their popularity in non-English NLP, multilingual language modelsoften underperform monolingual ones due to inter-language competition for modelparameters. We propose Cross-lingual Expert Language Models (X-ELM), whichmitigate this competition by independently training language models on subsetsof the multilingual corpus. This process specializes X-ELMs to differentlanguages while remaining effective as a multilingual ensemble. Our experimentsshow that when given the same compute budget, X-ELM outperforms jointly trainedmultilingual models across all considered languages and that these gainstransfer to downstream tasks. X-ELM provides additional benefits overperformance improvements: new experts can be iteratively added, adapting X-ELMto new languages without catastrophic forgetting. Furthermore, training isasynchronous, reducing the hardware requirements for multilingual training anddemocratizing multilingual modeling.