Abstract
In multilingual settings, non-Latin scripts and low-resource languages areusually disadvantaged in terms of language models' utility, efficiency, andcost. Specifically, previous studies have reported multiple modeling biasesthat the current tokenization algorithms introduce to non-Latin scriptlanguages, the main one being over-segmentation. In this work, we proposeMAGNET; multilingual adaptive gradient-based tokenization to reduceover-segmentation via adaptive gradient-based subword tokenization. MAGNETlearns to predict segment boundaries between byte tokens in a sequence viasub-modules within the model, which act as internal boundary predictors(tokenizers). Previous gradient-based tokenization methods aimed for uniformcompression across sequences by integrating a single boundary predictor duringtraining and optimizing it end-to-end through stochastic reparameterizationalongside the next token prediction objective. However, this approach stillresults in over-segmentation for non-Latin script languages in multilingualsettings. In contrast, MAGNET offers a customizable architecture wherebyte-level sequences are routed through language-script-specific predictors,each optimized for its respective language script. This modularity enforcesequitable segmentation granularity across different language scripts comparedto previous methods. Through extensive experiments, we demonstrate that inaddition to reducing segmentation disparities, MAGNET also enables fasterlanguage modelling and improves downstream utility.