One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers

Abstract

Pretraining massively multilingual Large Language Models (LLMs) for manylanguages at once is challenging due to limited model capacity, scarcehigh-quality data, and compute constraints. Moreover, the lack of languagecoverage of the tokenizer makes it harder to address the gap for new languagespurely at the post-training stage. In this work, we study what relatively cheapinterventions early on in training improve "language plasticity", or adaptationcapabilities of the model post-training to new languages. We focus on tokenizerdesign and propose using a universal tokenizer that is trained for morelanguages than the primary pretraining languages to enable efficient adaptationin expanding language coverage after pretraining. Our systematic experimentsacross diverse groups of languages and different training strategies show thata universal tokenizer enables significantly higher language adaptation, with upto 20.2% increase in win rates compared to tokenizers specific to pretraininglanguages. Furthermore, a universal tokenizer also leads to better plasticitytowards languages that are completely unseen in the tokenizer and pretraining,by up to 5% win rate gain. We achieve this adaptation to an expanded set oflanguages with minimal compromise in performance on the majority of languagesincluded in pretraining.

Quick Read (beta)

loading the full paper ...