Abstract
Large language models (LLMs) have gained popularity recently due to theiroutstanding performance in various downstream Natural Language Processing (NLP)tasks. However, low-resource languages are still lagging behind currentstate-of-the-art (SOTA) developments in the field of NLP due to insufficientresources to train LLMs. Ethiopian languages exhibit remarkable linguisticdiversity, encompassing a wide array of scripts, and are imbued with profoundreligious and cultural significance. This paper introduces EthioLLM --multilingual large language models for five Ethiopian languages (Amharic,Ge'ez, Afan Oromo, Somali, and Tigrinya) and English, and Ethiobenchmark -- anew benchmark dataset for various downstream NLP tasks. We evaluate theperformance of these models across five downstream NLP tasks. We open-sourceour multilingual language models, new benchmark datasets for various downstreamtasks, and task-specific fine-tuned language models and discuss the performanceof the models. Our dataset and models are available at thehttps://huggingface.co/EthioNLP repository.