Improving the Lexical Ability of Pretrained Language Models for Unsupervised Neural Machine Translation

Abstract

Successful methods for unsupervised neural machine translation (UNMT) employcrosslingual pretraining via self-supervision, often in the form of a maskedlanguage modeling or a sequence generation task, which requires the model toalign the lexical- and high-level representations of the two languages. Whilecross-lingual pretraining works for similar languages with abundant corpora, itperforms poorly in low-resource and distant languages. Previous research hasshown that this is because the representations are not sufficiently aligned. Inthis paper, we enhance the bilingual masked language model pretraining withlexical-level information by using type-level cross-lingual subword embeddings.Empirical results demonstrate improved performance both on UNMT (up to 4.5BLEU) and bilingual lexicon induction using our method compared to a UNMTbaseline.

Quick Read (beta)

loading the full paper ...