Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

Abstract

The NLP community has mainly focused on scaling Large Language Models (LLMs)vertically, i.e., making them better for about 100 languages. We instead scaleLLMs horizontally: we create, through continued pretraining, Glot500-m, an LLMthat covers 511 predominantly low-resource languages. An important part of thiseffort is to collect and clean Glot500-c, a corpus that covers these 511languages and allows us to train Glot500-m. We evaluate Glot500-m on fivediverse tasks across these languages. We observe large improvements for bothhigh-resource and low-resource languages compared to an XLM-R baseline. Ouranalysis shows that no single factor explains the quality of multilingual LLMrepresentations. Rather, a combination of factors determines quality includingcorpus size, script, "help" from related languages and the total capacity ofthe model. Our work addresses an important goal of NLP research: we should notlimit NLP to a small fraction of the world's languages and instead strive tosupport as many languages as possible to bring the benefits of NLP technologyto all languages and cultures. Code, data and models are available athttps://github.com/cisnlp/Glot500.

Quick Read (beta)

loading the full paper ...