How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

Abstract

In this work we provide a \textit{systematic empirical comparison} ofpretrained multilingual language models versus their monolingual counterpartswith regard to their monolingual task performance. We study a set of ninetypologically diverse languages with readily available pretrained monolingualmodels on a set of five diverse monolingual downstream tasks. We firstestablish if a gap between the multilingual and the corresponding monolingualrepresentation of that language exists, and subsequently investigate the reasonfor a performance difference. To disentangle the impacting variables, we trainnew monolingual models on the same data, but with different tokenizers, boththe monolingual and the multilingual version. We find that while thepretraining data size is an important factor, the designated tokenizer of themonolingual model plays an equally important role in the downstreamperformance. Our results show that languages which are adequately representedin the multilingual model's vocabulary exhibit negligible performance decreasesover their monolingual counterparts. We further find that replacing theoriginal multilingual tokenizer with the specialized monolingual tokenizerimproves the downstream performance of the multilingual model for almost everytask and language.

Quick Read (beta)

loading the full paper ...