Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

Abstract

The NLP community has mainly focused on scaling Large Language Models (LLMs)vertically, i.e., making them better for about 100 languages. We instead scaleLLMs horizontally: we create, through continued pretraining, Glot500-m, an LLMthat covers 511 languages, almost all of them low-resource. An important partof this effort is to collect and clean Glot500-c, a corpus that covers these511 languages and allows us to train Glot500-m. We evaluate Glot500-m on fivediverse tasks across these languages. We observe large improvements for bothhigh-resource and lowresource languages compared to an XLM-R baseline. Ouranalysis shows that no single factor explains the quality of multilingual LLMrepresentations. Rather, a combination of factors determines quality includingcorpus size, script, "help" from related languages and the total capacity ofthe model. Our work addresses an important goal of NLP research: we should notlimit NLP to a small fraction of the world's languages and instead strive tosupport as many languages as possible to bring the benefits of NLP technologyto all languages and cultures. Code, data and models are available athttps://github.com/cisnlp/Glot500.

Quick Read (beta)

loading the full paper ...