Abstract
Achieving universal translation between all human language pairs is theholy-grail of machine translation (MT) research. While recent progress inmassively multilingual MT is one step closer to reaching this goal, it isbecoming evident that extending a multilingual MT system simply by training onmore parallel data is unscalable, since the availability of labeled data forlow-resource and non-English-centric language pairs is forbiddingly limited. Tothis end, we present a pragmatic approach towards building a multilingual MTmodel that covers hundreds of languages, using a mixture of supervised andself-supervised objectives, depending on the data availability for differentlanguage pairs. We demonstrate that the synergy between these two trainingparadigms enables the model to produce high-quality translations in thezero-resource setting, even surpassing supervised translation quality for low-and mid-resource languages. We conduct a wide array of experiments tounderstand the effect of the degree of multilingual supervision, domainmismatches and amounts of parallel and monolingual data on the quality of ourself-supervised multilingual models. To demonstrate the scalability of theapproach, we train models with over 200 languages and demonstrate highperformance on zero-resource translation on several previously under-studiedlanguages. We hope our findings will serve as a stepping stone towards enablingtranslation for the next thousand languages.