SERENGETI: Massively Multilingual Language Models for Africa

Abstract

Multilingual pretrained language models (mPLMs) acquire valuable,generalizable linguistic information during pretraining and have advanced thestate of the art on task-specific finetuning. To date, only ~31 out of ~2,000African languages are covered in existing language models. We ameliorate thislimitation by developing SERENGETI, a massively multilingual language modelthat covers 517 African languages and language varieties. We evaluate our novelmodels on eight natural language understanding tasks across 20 datasets,comparing to 4 mPLMs that cover 4-23 African languages. SERENGETI outperformsother models on 11 datasets across the eights tasks, achieving 82.27 averageF_1. We also perform analyses of errors from our models, which allows us toinvestigate the influence of language genealogy and linguistic similarity whenthe models are applied under zero-shot settings. We will publicly release ourmodels forresearch.\footnote{\href{https://github.com/UBC-NLP/serengeti}{https://github.com/UBC-NLP/serengeti}}

Quick Read (beta)

loading the full paper ...