Lugha-Llama: Adapting Large Language Models for African Languages

Abstract

Large language models (LLMs) have achieved impressive results in a wide rangeof natural language applications. However, they often struggle to recognizelow-resource languages, in particular African languages, which are not wellrepresented in large training corpora. In this paper, we consider how to adaptLLMs to low-resource African languages. We find that combining curated datafrom African languages with high-quality English educational texts results in atraining mix that substantially improves the model's performance on theselanguages. On the challenging IrokoBench dataset, our models consistentlyachieve the best performance amongst similarly sized baselines, particularly onknowledge-intensive multiple-choice questions (AfriMMLU). Additionally, on thecross-lingual question answering benchmark AfriQA, our models outperform thebase model by over 10%. To better understand the role of English data duringtraining, we translate a subset of 200M tokens into Swahili language andperform an analysis which reveals that the content of these data is primarilyresponsible for the strong performance. We release our models and data toencourage future research on African languages.

Quick Read (beta)

loading the full paper ...