Abstract
Despite the widespread adoption of Large language models (LLMs), theirremarkable capabilities remain limited to a few high-resource languages.Additionally, many low-resource languages (\eg African languages) are oftenevaluated only on basic text classification tasks due to the lack ofappropriate or comprehensive benchmarks outside of high-resource languages. Inthis paper, we introduce IrokoBench -- a human-translated benchmark dataset for17 typologically-diverse low-resource African languages covering three tasks:natural language inference~(AfriXNLI), mathematical reasoning~(AfriMGSM), andmulti-choice knowledge-based question answering~(AfriMMLU). We use IrokoBenchto evaluate zero-shot, few-shot, and translate-test settings~(where test setsare translated into English) across 10 open and six proprietary LLMs. Ourevaluation reveals a significant performance gap between high-resourcelanguages~(such as English and French) and low-resource African languages. Weobserve a significant performance gap between open and proprietary models, withthe highest performing open model, Gemma 2 27B only at 63\% of thebest-performing proprietary model GPT-4o performance. In addition, machinetranslating the test set to English before evaluation helped to close the gapfor larger models that are English-centric, such as Gemma 2 27B and LLaMa 3.170B. These findings suggest that more efforts are needed to develop and adaptLLMs for African languages.