IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models

  • 2024-06-05 16:23:08
  • David Ifeoluwa Adelani, Jessica Ojo, Israel Abebe Azime, Jian Yun Zhuang, Jesujoba O. Alabi, Xuanli He, Millicent Ochieng, Sara Hooker, Andiswa Bukula, En-Shiun Annie Lee, Chiamaka Chukwuneke, Happy Buzaaba, Blessing Sibanda, Godson Kalipe, Jonathan Mukiibi, Salomon Kabongo, Foutse Yuehgoh, Mmasibidi Setaka, Lolwethu Ndolela, Nkiruka Odu, Rooweither Mabuya, Shamsuddeen Hassan Muhammad, Salomey Osei, Sokhar Samb, Tadesse Kebede Guge, Pontus Stenetorp
  • 0

Abstract

Despite the widespread adoption of Large language models (LLMs), theirremarkable capabilities remain limited to a few high-resource languages.Additionally, many low-resource languages (e.g. African languages) are oftenevaluated only on basic text classification tasks due to the lack ofappropriate or comprehensive benchmarks outside of high-resource languages. Inthis paper, we introduce IrokoBench -- a human-translated benchmark dataset for16 typologically-diverse low-resource African languages covering three tasks:natural language inference~(AfriXNLI), mathematical reasoning~(AfriMGSM), andmulti-choice knowledge-based QA~(AfriMMLU). We use IrokoBench to evaluatezero-shot, few-shot, and translate-test settings~(where test sets aretranslated into English) across 10 open and four proprietary LLMs. Ourevaluation reveals a significant performance gap between high-resourcelanguages~(such as English and French) and low-resource African languages. Weobserve a significant performance gap between open and proprietary models, withthe highest performing open model, Aya-101 only at 58\% of the best-performingproprietary model GPT-4o performance. Machine translating the test set toEnglish before evaluation helped to close the gap for larger models that areEnglish-centric, like LLaMa 3 70B. These findings suggest that more efforts areneeded to develop and adapt LLMs for African languages.

 

Quick Read (beta)

loading the full paper ...