Abstract
Pretrained multilingual large language models have typically used heuristictemperature-based sampling to balance between different languages. Howeverprevious work has not systematically evaluated the efficacy of differentpretraining language distributions across model scales. In this paper, wepropose a new sampling method, UniMax, that delivers more uniform coverage ofhead languages while mitigating overfitting on tail languages by explicitlycapping the number of repeats over each language's corpus. We perform anextensive series of ablations testing a range of sampling strategies on a suiteof multilingual benchmarks, while varying model scale. We find that UniMaxoutperforms standard temperature-based sampling, and the benefits persist asscale increases. As part of our contribution, we release: (i) an improved andrefreshed mC4 multilingual corpus consisting of 29 trillion characters across107 languages, and (ii) a suite of pretrained umT5 model checkpoints trainedwith UniMax sampling.