AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages

Abstract

Reproducible benchmarks are crucial in driving progress of machinetranslation research. However, existing machine translation benchmarks havebeen mostly limited to high-resource or well-represented languages. Despite anincreasing interest in low-resource machine translation, there are nostandardized reproducible benchmarks for many African languages, many of whichare used by millions of speakers but have less digitized textual data. Totackle these challenges, we propose AfroMT, a standardized, clean, andreproducible machine translation benchmark for eight widely spoken Africanlanguages. We also develop a suite of analysis tools for system diagnosistaking into account the unique properties of these languages. Furthermore, weexplore the newly considered case of low-resource focused pretraining anddevelop two novel data augmentation-based strategies, leveraging word-levelalignment information and pseudo-monolingual data for pretraining multilingualsequence-to-sequence models. We demonstrate significant improvements whenpretraining on 11 languages, with gains of up to 2 BLEU points over strongbaselines. We also show gains of up to 12 BLEU points over cross-lingualtransfer baselines in data-constrained scenarios. All code and pretrainedmodels will be released as further steps towards larger reproducible benchmarksfor African languages.

Quick Read (beta)

loading the full paper ...