Integrating Unsupervised Data Generation into Self-Supervised Neural Machine Translation for Low-Resource Languages

Abstract

For most language combinations, parallel data is either scarce or simplyunavailable. To address this, unsupervised machine translation (UMT) exploitslarge amounts of monolingual data by using synthetic data generation techniquessuch as back-translation and noising, while self-supervised NMT (SSNMT)identifies parallel sentences in smaller comparable data and trains on them. Todate, the inclusion of UMT data generation techniques in SSNMT has not beeninvestigated. We show that including UMT techniques into SSNMT significantlyoutperforms SSNMT and UMT on all tested language pairs, with improvements of upto +4.3 BLEU, +50.8 BLEU, +51.5 over SSNMT, statistical UMT and hybrid UMT,respectively, on Afrikaans to English. We further show that the combination ofmultilingual denoising autoencoding, SSNMT with backtranslation and bilingualfinetuning enables us to learn machine translation even for distant languagepairs for which only small amounts of monolingual data are available, e.g.yielding BLEU scores of 11.6 (English to Swahili).

Quick Read (beta)

loading the full paper ...