The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants

Abstract

We present Belebele, a multiple-choice machine reading comprehension (MRC)dataset spanning 122 language variants. Significantly expanding the languagecoverage of natural language understanding (NLU) benchmarks, this datasetenables the evaluation of text models in high-, medium-, and low-resourcelanguages. Each question is based on a short passage from the Flores-200dataset and has four multiple-choice answers. The questions were carefullycurated to discriminate between models with different levels of generallanguage comprehension. The English dataset on its own proves difficult enoughto challenge state-of-the-art language models. Being fully parallel, thisdataset enables direct comparison of model performance across all languages. Weuse this dataset to evaluate the capabilities of multilingual masked languagemodels (MLMs) and large language models (LLMs). We present extensive resultsand find that despite significant cross-lingual transfer in English-centricLLMs, much smaller MLMs pretrained on balanced multilingual data stillunderstand far more languages. We also observe that larger vocabulary size andconscious vocabulary construction correlate with better performance onlow-resource languages. Overall, Belebele opens up new avenues for evaluatingand analyzing the multilingual capabilities of NLP systems.

Quick Read (beta)

loading the full paper ...