Abstract
Existing work in multilingual pretraining has demonstrated the potential ofcross-lingual transferability by training a unified Transformer encoder formultiple languages. However, much of this work only relies on the sharedvocabulary and bilingual contexts to encourage the correlation acrosslanguages, which is loose and implicit for aligning the contextualrepresentations between languages. In this paper, we plug a cross-attentionmodule into the Transformer encoder to explicitly build the interdependencebetween languages. It can effectively avoid the degeneration of predictingmasked words only conditioned on the context in its own language. Moreimportantly, when fine-tuning on downstream tasks, the cross-attention modulecan be plugged in or out on-demand, thus naturally benefiting a wider range ofcross-lingual tasks, from language understanding to generation. As a result, the proposed cross-lingual model delivers new state-of-the-artresults on various cross-lingual understanding tasks of the XTREME benchmark,covering text classification, sequence labeling, question answering, andsentence retrieval. For cross-lingual generation tasks, it also outperforms allexisting cross-lingual models and state-of-the-art Transformer variants onWMT14 English-to-German and English-to-French translation datasets, with gainsof up to 1~2 BLEU.