Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning

Abstract

Most Transformer language models are primarily pretrained on English text,limiting their use for other languages. As the model sizes grow, theperformance gap between English and other languages with fewer compute and dataresources increases even further. Consequently, more resource-efficienttraining methods are needed to bridge the gap for languages with fewerresources available. To address this problem, we introduce a cross-lingual andprogressive transfer learning approach, called CLP-Transfer, that transfersmodels from a source language, for which pretrained models are publiclyavailable, like English, to a new target language. As opposed to prior work,which focused on the cross-lingual transfer between two languages, we extendthe transfer to the model size. Given a pretrained model in a source language,we aim for a same-sized model in a target language. Instead of training a modelfrom scratch, we exploit a smaller model that is in the target language butrequires much fewer resources. Both small and source models are then used toinitialize the token embeddings of the larger model based on the overlappingvocabulary of the source and target language. All remaining weights are reusedfrom the model in the source language. This approach outperforms the solecross-lingual transfer and can save up to 80% of the training steps compared tothe random initialization.

Quick Read (beta)

loading the full paper ...