Abstract
Despite the widespread adoption of Large language models (LLMs), theirremarkable capabilities remain limited to a few high-resource languages.Additionally, many low-resource languages (e.g. African languages) are oftenevaluated only on basic text classification tasks due to the lack ofappropriate or comprehensive benchmarks outside of high-resource languages. Inthis paper, we introduce IrokoBench -- a human-translated benchmark dataset for16 typologically-diverse low-resource African languages covering three tasks:natural language inference~(AfriXNLI), mathematical reasoning~(AfriMGSM), andmulti-choice knowledge-based QA~(AfriMMLU). We use IrokoBench to evaluatezero-shot, few-shot, and translate-test settings~(where test sets aretranslated into English) across 10 open and four proprietary LLMs. Ourevaluation reveals a significant performance gap between high-resourcelanguages~(such as English and French) and low-resource African languages. Weobserve a significant performance gap between open and proprietary models, withthe highest performing open model, Aya-101 only at 58\% of the best-performingproprietary model GPT-4o performance. Machine translating the test set toEnglish before evaluation helped to close the gap for larger models that areEnglish-centric, like LLaMa 3 70B. These findings suggest that more efforts areneeded to develop and adapt LLMs for African languages.