Abstract
A multilingual tokenizer is a fundamental component of multilingual neuralmachine translation. It is trained from a multilingual corpus. Since a skeweddata distribution is considered to be harmful, a sampling strategy is usuallyused to balance languages in the corpus. However, few works have systematicallyanswered how language imbalance in tokenizer training affects downstreamperformance. In this work, we analyze how translation performance changes asthe data ratios among languages vary in the tokenizer training corpus. We findthat while relatively better performance is often observed when languages aremore equally sampled, the downstream performance is more robust to languageimbalance than we usually expected. Two features, UNK rate and closeness to thecharacter level, can warn of poor downstream performance before performing thetask. We also distinguish language sampling for tokenizer training fromsampling for model training and show that the model is more sensitive to thelatter.