No Language Left Behind: Scaling Human-Centered Machine Translation

Abstract

Driven by the goal of eradicating language barriers on a global scale,machine translation has solidified itself as a key focus of artificialintelligence research today. However, such efforts have coalesced around asmall subset of languages, leaving behind the vast majority of mostlylow-resource languages. What does it take to break the 200 language barrierwhile ensuring safe, high quality results, all while keeping ethicalconsiderations in mind? In No Language Left Behind, we took on this challengeby first contextualizing the need for low-resource language translation supportthrough exploratory interviews with native speakers. Then, we created datasetsand models aimed at narrowing the performance gap between low and high-resourcelanguages. More specifically, we developed a conditional compute model based onSparsely Gated Mixture of Experts that is trained on data obtained with noveland effective data mining techniques tailored for low-resource languages. Wepropose multiple architectural and training improvements to counteractoverfitting while training on thousands of tasks. Critically, we evaluated theperformance of over 40,000 different translation directions using ahuman-translated benchmark, Flores-200, and combined human evaluation with anovel toxicity benchmark covering all languages in Flores-200 to assesstranslation safety. Our model achieves an improvement of 44% BLEU relative tothe previous state-of-the-art, laying important groundwork towards realizing auniversal translation system. Finally, we open source all contributionsdescribed in this work, accessible athttps://github.com/facebookresearch/fairseq/tree/nllb.

Quick Read (beta)

loading the full paper ...