Emergent Abilities of Large Language Models under Continued Pretraining for Language Adaptation

Abstract

Continued pretraining (CPT) is a popular approach to adapt existing largelanguage models (LLMs) to new languages. When doing so, it is common practiceto include a portion of English data in the mixture, but its role has not beencarefully studied to date. In this work, we show that including English doesnot impact validation perplexity, yet it is critical for the emergence ofdownstream capabilities in the target language. We introduce alanguage-agnostic benchmark for in-context learning (ICL), which revealscatastrophic forgetting early on CPT when English is not included. This in turndamages the ability of the model to generalize to downstream prompts in thetarget language as measured by perplexity, even if it does not manifest interms of accuracy until later in training, and can be tied to a big shift inthe model parameters. Based on these insights, we introduce curriculum learningand exponential moving average (EMA) of weights as effective alternatives tomitigate the need for English. All in all, our work sheds light into thedynamics by which emergent abilities arise when doing CPT for languageadaptation, and can serve as a foundation to design more effective methods inthe future.

Quick Read (beta)

loading the full paper ...