Abstract
Large multilingual language models typically rely on a single vocabularyshared across 100+ languages. As these models have increased in parameter countand depth, vocabulary size has remained largely unchanged. This vocabularybottleneck limits the representational capabilities of multilingual models likeXLM-R. In this paper, we introduce a new approach for scaling to very largemultilingual vocabularies by de-emphasizing token sharing between languageswith little lexical overlap and assigning vocabulary capacity to achievesufficient coverage for each individual language. Tokenizations using ourvocabulary are typically more semantically meaningful and shorter compared toXLM-R. Leveraging this improved vocabulary, we train XLM-V, a multilinguallanguage model with a one million token vocabulary. XLM-V outperforms XLM-R onevery task we tested on ranging from natural language inference (XNLI),question answering (MLQA, XQuAD, TyDiQA), and named entity recognition(WikiAnn) to low-resource tasks (Americas NLI, MasakhaNER).