SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

Abstract

Despite the progress we have recorded in the last few years in multilingualnatural language processing, evaluation is typically limited to a small set oflanguages with available datasets which excludes a large number of low-resourcelanguages. In this paper, we created SIB-200 -- a large-scale open-sourcedbenchmark dataset for topic classification in 200 languages and dialects toaddress the lack of evaluation dataset for Natural Language Understanding(NLU). For many of the languages covered in SIB-200, this is the first publiclyavailable evaluation dataset for NLU. The dataset is based on Flores-200machine translation corpus. We annotated the English portion of the dataset andextended the sentence-level annotation to the remaining 203 languages coveredin the corpus. Despite the simplicity of this task, our evaluation infull-supervised setting, cross-lingual transfer setting and prompting of largelanguage model setting show that there is still a large gap between theperformance of high-resource and low-resource languages when multilingualevaluation is scaled to numerous world languages. We found that languagesunseen during the pre-training of multilingual language models,under-represented language families (like Nilotic and Altantic-Congo), andlanguages from the regions of Africa, Americas, Oceania and South East Asia,often have the lowest performance on our topic classification dataset. We hopeour dataset will encourage a more inclusive evaluation of multilingual languagemodels on a more diverse set of languages. https://github.com/dadelani/sib-200

Quick Read (beta)

loading the full paper ...