Abstract
In this work, we introduce EMMA-500, a large-scale multilingual languagemodel continue-trained on texts across 546 languages designed for enhancedmultilingual performance, focusing on improving language coverage forlow-resource languages. To facilitate continual pre-training, we compile theMaLA corpus, a comprehensive multilingual dataset enriched with curateddatasets across diverse domains. Leveraging this corpus, we conduct extensivecontinual pre-training of the Llama 2 7B model, resulting in EMMA-500, whichdemonstrates robust performance across a wide collection of benchmarks,including a comprehensive set of multilingual tasks and PolyWrite, anopen-ended generation benchmark developed in this study. Our results highlightthe effectiveness of continual pre-training in expanding large language models'language capacity, particularly for underrepresented languages, demonstratingsignificant gains in cross-lingual transfer, task generalization, and languageadaptability.