Abstract
A major consideration in multilingual language modeling is how to bestrepresent languages with diverse vocabularies and scripts. Althoughcontemporary text encoding methods cover most of the world's writing systems,they exhibit bias towards the high-resource languages of the Global West. As aresult, texts of underrepresented languages tend to be segmented into longsequences of linguistically meaningless units. To address the disparities, weintroduce a new paradigm that encodes the same information with segments ofconsistent size across diverse languages. Our encoding convention (MYTE) isbased on morphemes, as their inventories are more balanced across languagesthan characters, which are used in previous methods. We show that MYTE producesshorter encodings for all 99 analyzed languages, with the most notableimprovements for non-European languages and non-Latin scripts. This, in turn,improves multilingual LM performance and diminishes the perplexity gapthroughout diverse languages.