EthioMT: Parallel Corpus for Low-resource Ethiopian Languages

Abstract

Recent research in natural language processing (NLP) has achieved impressiveperformance in tasks such as machine translation (MT), news classification, andquestion-answering in high-resource languages. However, the performance of MTleaves much to be desired for low-resource languages. This is due to thesmaller size of available parallel corpora in these languages, if such corporaare available at all. NLP in Ethiopian languages suffers from the same issuesdue to the unavailability of publicly accessible datasets for NLP tasks,including MT. To help the research community and foster research for Ethiopianlanguages, we introduce EthioMT -- a new parallel corpus for 15 languages. Wealso create a new benchmark by collecting a dataset for better-researchedlanguages in Ethiopia. We evaluate the newly collected corpus and the benchmarkdataset for 23 Ethiopian languages using transformer and fine-tuningapproaches.

Quick Read (beta)

loading the full paper ...