Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT

Abstract

Using a language model (LM) pretrained on two languages with largemonolingual data in order to initialize an unsupervised neural machinetranslation (UNMT) system yields state-of-the-art results. When limited data isavailable for one language, however, this method leads to poor translations. Wepresent an effective approach that reuses an LM that is pretrained only on thehigh-resource language. The monolingual LM is fine-tuned on both languages andis then used to initialize a UNMT model. To reuse the pretrained LM, we have tomodify its predefined vocabulary, to account for the new language. We thereforepropose a novel vocabulary extension method. Our approach, RE-LM, outperforms acompetitive cross-lingual pretraining model (XLM) in English-Macedonian (En-Mk)and English-Albanian (En-Sq), yielding more than +8.3 BLEU points for all fourtranslation directions.

Quick Read (beta)

loading the full paper ...