Training Bilingual LMs with Data Constraints in the Targeted Language

Abstract

Large language models are trained on massive scrapes of the web, as requiredby current scaling laws. Most progress is made for English, given its abundanceof high-quality pretraining data. For most other languages, however, such highquality pretraining data is unavailable. In this work, we study how to boostpretrained model performance in a data constrained target language by enlistingdata from an auxiliary language for which high quality data is available. Westudy this by quantifying the performance gap between training with data in adata-rich auxiliary language compared with training in the target language,exploring the benefits of translation systems, studying the limitations ofmodel scaling for data constrained languages, and proposing new methods forupsampling data from the auxiliary language. Our results show that strongerauxiliary datasets result in performance gains without modification to themodel or training objective for close languages, and, in particular, thatperformance gains due to the development of more information-rich Englishpretraining datasets can extend to targeted language settings with limiteddata.

Quick Read (beta)

loading the full paper ...