Toucan: Many-to-Many Translation for 150 African Language Pairs

  • 2024-07-12 18:13:47
  • AbdelRahim Elmadany, Ife Adebara, Muhammad Abdul-Mageed
  • 0

Abstract

We address a notable gap in Natural Language Processing (NLP) by introducinga collection of resources designed to improve Machine Translation (MT) forlow-resource languages, with a specific focus on African languages. First, weintroduce two language models (LMs), Cheetah-1.2B and Cheetah-3.7B, with 1.2billion and 3.7 billion parameters respectively. Next, we finetune theaforementioned models to create toucan, an Afrocentric machine translationmodel designed to support 156 African language pairs. To evaluate Toucan, wecarefully develop an extensive machine translation benchmark, dubbedAfroLingu-MT, tailored for evaluating machine translation. Toucan significantlyoutperforms other models, showcasing its remarkable performance on MT forAfrican languages. Finally, we train a new model, spBLEU-1K, to enhancetranslation evaluation metrics, covering 1K languages, including 614 Africanlanguages. This work aims to advance the field of NLP, fostering cross-culturalunderstanding and knowledge exchange, particularly in regions with limitedlanguage resources such as Africa. The GitHub repository for the Toucan projectis available at https://github.com/UBC-NLP/Toucan.

 

Quick Read (beta)

loading the full paper ...