Training Bilingual LMs with Data Constraints in the Targeted Language

Abstract

Large language models are trained on massive scrapes of the web, as requiredby current scaling laws. Most progress is made for English, given its abundanceof high-quality pretraining data. For most other languages, however, such highquality pretraining data is unavailable. In this work, we study how to boostpretrained model performance in a target language with insufficient pretrainingdata for training a high performing language model, by enlisting data from anauxiliary language for which high quality data is available. We study this byquantifying the performance gap between training with data in a data-richauxiliary language compared with training in the target language, exploring thebenefits of translation systems, studying the limitations of model scaling whendata is limited in the target languages, and proposing new methods forupsampling data from the auxiliary language. Our results show that strongerauxiliary datasets result in performance gains without modification to themodel or training objective for close languages, and, in particular, thatperformance gains due to the development of more information-rich Englishpretraining datasets can extend to targeted language settings with limiteddata.

Quick Read (beta)

loading the full paper ...