ChocoLlama: Lessons Learned From Teaching Llamas Dutch

Abstract

While Large Language Models (LLMs) have shown remarkable capabilities innatural language understanding and generation, their performance often lags inlower-resource, non-English languages due to biases in the training data. Inthis work, we explore strategies for adapting the primarily English LLMs(Llama-2 and Llama-3) to Dutch, a language spoken by 30 million peopleworldwide yet often underrepresented in LLM development. We collect 104GB ofDutch text ($32$B tokens) from various sources to first apply continuedpretraining using low-rank adaptation (LoRA), complemented with Dutchposttraining strategies provided by prior work. For Llama-2, we consider using(i) the tokenizer of the original model, and (ii) training a new,Dutch-specific tokenizer combined with embedding reinitialization. We evaluateour adapted models, ChocoLlama-2, both on standard benchmarks and a novel Dutchbenchmark, ChocoLlama-Bench. Our results demonstrate that LoRA can effectivelyscale for language adaptation, and that tokenizer modification with carefulweight reinitialization can improve performance. Notably, Llama-3 was releasedduring the course of this project and, upon evaluation, demonstrated superiorDutch capabilities compared to our Dutch-adapted versions of Llama-2. We henceapply the same adaptation technique to Llama-3, using its original tokenizer.While our adaptation methods enhanced Llama-2's Dutch capabilities, we foundlimited gains when applying the same techniques to Llama-3. This suggests thatfor ever improving, multilingual foundation models, language adaptationtechniques may benefit more from focusing on language-specific posttrainingrather than on continued pretraining. We hope this work contributes to thebroader understanding of adapting LLMs to lower-resource languages, and to thedevelopment of Dutch LLMs in particular.

Quick Read (beta)

loading the full paper ...