Abstract
Recent large language models (LLM) exhibit sub-optimal performance onlow-resource languages, as the training data of these models is usuallydominated by English and other high-resource languages. Furthermore, it ischallenging to train models for low-resource languages, especially fromscratch, due to a lack of high quality training data. Adapting pretrained LLMsreduces the need for data in the new language while also providing crosslingual transfer capabilities. However, naively adapting to new languages leadsto catastrophic forgetting and poor tokenizer efficiency. In this work, westudy how to efficiently adapt any existing pretrained LLM to a new languagewithout running into these issues. In particular, we improve the encodingefficiency of the tokenizer by adding new tokens from the target language andstudy the data mixing recipe to mitigate forgetting. Our experiments onadapting an English LLM to Hungarian and Thai show that our recipe can reachbetter performance than open source models on the target language, with minimalregressions on English.