Cross-Lingual Natural Language Generation via Pre-Training

  • 2019-11-22 09:24:46
  • Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian-Ling Mao, Heyan Huang
  • 0


In this work we focus on transferring supervision signals of natural languagegeneration (NLG) tasks between multiple languages. We propose to pretrain theencoder and the decoder of a sequence-to-sequence model under both monolingualand cross-lingual settings. The pre-training objective encourages the model torepresent different languages in the shared space, so that we can conductzero-shot cross-lingual transfer. After the pre-training procedure, we usemonolingual data to fine-tune the pre-trained model on downstream NLG tasks.Then the sequence-to-sequence model trained in a single language can bedirectly evaluated beyond that language (i.e., accepting multi-lingual inputand producing multi-lingual output). Experimental results on questiongeneration and abstractive summarization show that our model outperforms themachine-translation-based pipeline methods for zero-shot cross-lingualgeneration. Moreover, cross-lingual transfer improves NLG performance oflow-resource languages by leveraging rich-resource language data. Ourimplementation and data are available at


Quick Read (beta)

Cross-Lingual Natural Language Generation via Pre-Training

Zewen Chi,  Li Dong,  Furu Wei,  Wenhui Wang,  Xian-Ling Mao,  Heyan Huang
Beijing Institute of Technology
Microsoft Research
  Contribution during internship at Microsoft Research.

In this work we focus on transferring supervision signals of natural language generation (NLG) tasks between multiple languages. We propose to pretrain the encoder and the decoder of a sequence-to-sequence model under both monolingual and cross-lingual settings. The pre-training objective encourages the model to represent different languages in the shared space, so that we can conduct zero-shot cross-lingual transfer. After the pre-training procedure, we use monolingual data to fine-tune the pre-trained model on downstream NLG tasks. Then the sequence-to-sequence model trained in a single language can be directly evaluated beyond that language (i.e., accepting multi-lingual input and producing multi-lingual output). Experimental results on question generation and abstractive summarization show that our model outperforms the machine-translation-based pipeline methods for zero-shot cross-lingual generation. Moreover, cross-lingual transfer improves NLG performance of low-resource languages by leveraging rich-resource language data. Our implementation and data are available at

Cross-Lingual Natural Language Generation via Pre-Training

Zewen Chithanks:   Contribution during internship at Microsoft Research.,  Li Dong,  Furu Wei,  Wenhui Wang,  Xian-Ling Mao,  Heyan Huang Beijing Institute of Technology Microsoft Research {czw,maoxl,hhy63} {lidong1,fuwei,Wenhui.Wang}

Copyright © 2020, Association for the Advancement of Artificial Intelligence ( All rights reserved.

1 Introduction

Learning natural language generation (NLG) models heavily relies on annotated training data. However, most available datasets are collected in a single language (typically English), which restricts deploying the applications to other languages. In this work, we aim at transferring the supervision of a monolingual NLG dataset to unseen languages, so that we can boost performance for the low-resource settings.

Figure 1: We use a monolingual (such as English) NLG dataset to fine-tune the pre-trained model Xnlg, and then evaluate it beyond the language for both source and target sides (e.g., Chinese, and French).

Various methods have been proposed over the years to learn cross-lingual word embeddings (???) or sentence encoders (???), which try to encode multilingual texts into a shared vector space. Despite achieving promising results on cross-lingual classification problems, cross-lingual pre-trained models purposed for NLG tasks remains relatively understudied.

The cross-lingual generation problem is challenging due to the following reasons. First, it requires the models to understand multilingual input texts, and generate multilingual target sequences. So both the encoder and the decoder should be pre-trained together. Second, the many-to-many nature of cross-lingual NLG increases language pairs with the square of the number of languages. Third, the prediction space of cross-lingual NLG is much larger than classification tasks, which makes knowledge transfer of decoders quite critical.

Previous work mainly relies on machine translation (MT) to map texts to different languages. The first strand of research directly uses MT in a pipeline manner (?). For example, the inputs written in other languages are first translated to English, and fed into the NLG model that is trained by English data. Then the generated English texts are translated back to the target language. Another strand of work uses MT to generate pseudo training data for other language pairs that are lack of annotations (??). However, such methods have to use multiple MT systems, which renders them suffering from error propagation. Moreover, because the pipeline-based methods do not explicitly share the same parameter space across languages, we can not directly transfer the task-specific supervision to other low-resource languages.

In this paper, we propose a cross-lingual pre-trained model (named as Xnlg) in order to transfer monolingual NLG supervision to other pre-trained languages by fine-tuning. Specifically, Xnlg shares the same sequence-to-sequence model across languages, and is pre-trained with both monolingual and cross-lingual objectives. The model not only learns to understand multilingual input, but also is able to generate specific languages by conditioning on the encoded semantics. Figure 1 demonstrates how to use Xnlg to perform cross-lingual transfer for downstream tasks. The proposed model enables us to fine-tune the pre-trained model on monolingual NLG training data, and then evaluate it beyond a single language, including zero-shot cross-lingual generation. Besides, we explore several fine-tuning strategies to make a compromise between cross-lingual ability and task ability. In addition, we introduce two cross-lingual NLG datasets (i.e., question generation, and abstractive summarization) for evaluation, which includes three languages, namely English, Chinese, and French. Experimental results on the NLG tasks show that Xnlg achieves competitive performance compared with the machine-translation-based pipeline model in zero-shot cross-lingual settings.

2 Related Work

Cross-Lingual NLG

Several previous methods have been proposed for cross-lingual abstractive summarization. ? (?) and ? (?) use translated documents or summaries as pseudo training data. ? (?) incorporate monolingual summarization and machine translation to improve cross-lingual summarization. However, the systems only conduct experiments that generate summaries with different languages from the input language, rather than transferring supervision signals across all language pairs. ? (?) use training data annotated in multiple languages to jointly train a sequence-to-sequence model for question generation. In contrast, our method can also be applied to zero-shot settings across languages.

Monolingual Pre-Training

Various training objectives are designed to pretrain text encoders used for general-purpose representations, such as language modeling (?????), auto-encoding (?), and machine translation (?). Apart from pre-training encoders, several pre-trained models (??) are proposed for generation tasks. In comparison, our goal is to investigate a pre-training method for cross-lingual NLG tasks.

Cross-Lingual Pre-Training

By pre-training BERT (?) on corpus of multiple languages, it shows a surprising ability to produce cross-lingual representations (?). More recently, ? (?) extend mask language modeling pre-training to cross-lingual settings, which shows significant improvements on cross-lingual classification and unsupervised machine translation. By comparison, we pretrain both encoder and decoder for cross-lingual generation tasks, rather than only focusing on encoder. ? (?) use the sequence encoder of the multilingual translation model (?) to produce cross-lingual sentence embeddings. However, as shown in the experiments (Section 4), it is difficult to control the target language by directly fine-tuning the pre-trained translation model on downstream NLG tasks.

3 Methods

Figure 2: Overview of the pre-training tasks and the pre-training protocol designed for Xnlg.

As shown in Figure 2, Xnlg is a pre-trained sequence-to-sequence model, which is based on Transformer (?). Both the encoder and the decoder are supposed to support multiple languages. Following (?), we use language tag embeddings to distinguish the source and target languages. Given a sentence and its corresponding language tag, Xnlg encodes the input into vector representations. By conditioning on the encoding vectors and a specific language tag, the decoder generates the output sequence in the target language.

3.1 Pre-Training Tasks

Monolingual MLM

The masked language modeling (MLM) (?) task aims at predicting the randomly masked words according to their context. The objective pretrains the bidirectional encoder to obtain contextual representations. Following (?), we randomly mask 15% of the tokens in a monolingual sentence. For each masked token, we substitute it with a special token [M], a random token, or the unchanged token with probabilities of 0.8, 0.1, and 0.1, respectively. Let x denote a sentence from the monolingual training corpus, and Mx the set of randomly masked positions. The monolingual MLM loss is defined as:

MLM(x)=-iMxlogp(xi|xMx) (1)

where xMx is the masked version of input x. The language tags are fed into the model for all pre-training tasks.

Denoising Auto-Encoding (DAE)

We use the denoising auto-encoding (DAE) objective (?) to pretrain the encoder-decoder attention mechanism. Given sentence x from the monolingual corpus, we use three types of noise to obtain the randomly perturbed text x^. First, the word order is locally shuffled. Second, we randomly drop tokens of the sentence with a probability of 0.1. Third, we substitute tokens with the special padding token [P] with a probability of 0.1. The pre-training objective is to recover the original sentence x by conditioning on x^. The DAE loss is computed via:

DAE(x)=-logp(x|x^)=-i=1|x|logp(xi|x^,x<i) (2)

where x<i=x1,,xi-1.

Cross-Lingual MLM (XMLM)

Similar to monolingual MLM, the masked token prediction task can be extended to cross-lingual settings (?). To be specific, given a parallel corpus, we concatenate the pair of bilingual sentences (x,y) to a whole sequence, and use it as the input of MLM. The language tags are also fed into the model to indicate the languages of tokens. During training, we adopt the same masking strategy as monolingual MLM. Apart from using monolingual context to predict the masked tokens, XMLM encourages the model to utilize the alignment of bilingual sentences, so that the model learns to map cross-lingual texts into a shared vector space. Similar to Equation (1), the cross-lingual MLM loss is:

XMLM(x,y)=-iMxlogp(xi|xMx,yMy) (3)
-iMylogp(yi|xMx,yMy) (4)

where Mx,My represent the masked positions of x and y, respectively.

Cross-Lingual Auto-Encoding (XAE)

If only DAE is used as the pre-training task for the decoder, we found that the model ignores the target language tag while generating just the same language as the input, caused by the spurious correlation issue (?). In other words, the DAE loss captures the spurious correlation between the source language tag and the target sentences, but we expect the language of generated sentences can be controlled by the target language tag. To solve the above problem, we use machine translation as the cross-lingual auto-encoding (XAE) task, which decreases mutual information between the target sentences and the source language tag. XAE can be viewed as the multilingual-version DAE task in the sense that both of them recover the sentence by conditioning on the encoded representations. The cross-lingual auto-encoding loss is:

XAE(x,y)=-logp(y|x)-logp(x|y) (5)

where (x,y) is a pair of sentences in the parallel corpus.

3.2 Pre-Training Protocol

As shown in Figure 2(b), we propose a two-stage pre-training protocol for Xnlg. The first stage pretrains the encoding components, where the model learns to encode multilingual sentences to a shared embedding space. We consider using MLM and XMLM as the pre-training tasks. The objective of the first stage is to minimize:

1=(x,y)𝒟pXMLM(x,y)+x𝒟mMLM(x) (6)

where 𝒟p indicates the parallel corpus, and 𝒟m is the monolingual corpus.

Although the pre-trained encoder in the first stage enables the model to encode multilingual sentences. However, it cannot directly be used in cross-lingual NLG because: 1) encoder-decoder attention is not pre-trained; 2) the decoding algorithm is different between masked language modeling and autoregressive decoding, resulting in the mismatch between pre-training and fine-tuning. Therefore, we conduct decoding pre-training in the second stage by using DAE and XAE as the tasks. Besides, we only update decoder parameters and keep the encoder fixed. The objective of the second stage is to minimize:

2=(x,y)𝒟pXAE(x,y)+x𝒟mDAE(x) (7)

3.3 Fine-Tuning on Downstream NLG Tasks

In the fine-tuning procedure, let us assume that we only have English training data for downstream NLG tasks. According to whether the target language is English, the directions of NLG can be categorized into two classes: any languages to non-English languages (Any-to-Others), and any languages to English (Any-to-English).

Fine-Tuning for Any-to-Others NLG

Ideally, the model can be fine-tuned towards a new task without losing its cross-lingual ability. However, we observe the catastrophic forgetting of target language controllability, if we fine-tune all the model parameters for Any-to-Others NLG. So we keep the decoder and the word embeddings frozen and only update the encoder parameters during fine-tuning. In practice, we found that the proposed fine-tuning method prevents the model from only decoding English words for the Any-to-Others setting.

Fine-Tuning for Any-to-English NLG

For the Any-to-English NLG transfer, the decoder always generates English. So we can freeze the encoder parameters, and update the decoder parameters to retain the cross-lingual ability. As an alternative way, we can also fine-tune all the parameters to obtain the best results on the English dataset while having a slight drop in performance.

Data Models BL-4 MTR RG-L
En-QG CorefNqg 15.16 19.12 -
Mp-Gsn 16.38 20.25 44.48
Xlm 16.94 21.87 46.45
Xnlg 19.99 24.05 48.74
Zh-QG Xlm 23.41 23.32 47.40
Xnlg 24.89 24.53 49.72
Table 1: Evaluation results of monolingual supervised question generation for English and Chinese. BL is short for BLEU, MTR for METEOR, and RG for ROUGE. The results with “” are reported on different data splits.
Models BL-4 MTR RG-L
Xlm 0.25 0.62 2.56
Pipeline (Xlm) 4.42 9.59 21.22
    w/ Google Translator 9.95 14.92 29.37
Xnlg 16.37 18.74 34.93
Table 2: Evaluation results of zero-shot Chinese-Chinese question generation. Same shorthands apply as in Table 1.

4 Experiments

We conduct experiments over two cross-lingual NLG downstream tasks, i.e., cross-lingual question generation, and cross-lingual abstractive summarization. We compare Xnlg with state-of-the-art cross-lingual pre-trained models, and machine-translation-based pipelines.

Models Rel Flu Corr
Xlm 0 0 0
Pipeline (Xlm) 0.50 0.80 0.03
    w/ Google Translator 1.31 1.43* 0.69
Xnlg 1.68* 1.29 0.89*
Table 3: Human evaluation results of zero-shot Chinese-Chinese question generation. Rel is short for relatedness, Flu for fluency, and Corr for correctness. “*” indicates the improvements are significant at p<0.05.
Models Rel Flu Corr
Xlm 0 0 0
Pipeline (Xlm) 0.87 0.86 0.28
Xnlg 1.09* 0.95 0.53*
Table 4: Human evaluation results of zero-shot English-Chinese question generation. “*” indicates the improvements are significant at p<0.05. Same shorthands apply as in Table 3.
Models Rel Flu Corr
Xlm 1.00 1.20 0.40
Pipeline (Xlm) 0.85 0.98 0.28
Xnlg 1.24* 1.47* 0.76*
Table 5: Human evaluation results of zero-shot Chinese-English question generation. “*”: the improvements are significant at p<0.05. Same shorthands apply as in Table 3.
Data Models RG-1 RG-2 RG-L
En-AS Xlm 48.15 26.35 45.04
Xnlg 48.76 26.82 45.57
Fr-AS Xlm 56.27 39.20 52.84
Xnlg 57.84 40.81 54.24
Zh-AS Xlm 55.30 42.57 52.95
Xnlg 57.65 44.93 54.95
Table 6: Evaluation results of supervised monolingual summarization. Same shorthands apply as in Table 1.
Models RG-1 RG-2 RG-L
Xlm 14.53 1.80 13.43
Pipeline (Xlm) 30.58 12.01 27.44
    w/ Google Translator 38.48 18.86 34.98
Xnlg 39.98 20.31 36.31
Table 7: Evaluation results of zero-shot French abstractive summarization. Same shorthands apply as in Table 1.
Models RG-1 RG-2 RG-L
Xlm 0.71 0.28 0.70
Pipeline (Xlm) 26.39 13.11 23.98
    w/ Google Translator 36.96 22.03 33.99
Xnlg 41.66 28.70 38.91
Table 8: Evaluation results of zero-shot Chinese abstractive summarization. Same shorthands apply as in Table 1.

4.1 Training Details


We use a pre-trained Xnlg with a 10-layer encoder and a 6-layer decoder. For every Transformer layer, we use 1024 hidden units, 8 attention heads, and GELU activations (?). In the first pre-training stage, we directly use the 15-language pre-trained XLM (?) to initialize the parameters of our encoder and decoder. In the second stage, we use Wikipedia as the monolingual data for the DAE objective, and MultiUN (?) as the parallel data for the XAE objective. The DAE loss is trained with a weight of 0.5. We train a two-language (English/Chinese) and a three-language (English/French/Chinese) Xnlg for two downstream NLG tasks, respectively. Following (?), we use the tokenizer provided by (?) for Chinese, and Moses11 1 for other languages, respectively. Then the words in all languages are split with a shared subword vocabulary learned by BPE (?). We use Adam optimizer with a linear warm-up over the first 4,000 steps and linear decay for later steps, and the learning rate is set to 10-4. The pre-training batch size is 64, and the sequence length is set to 256. It takes about 30 hours to run 23,000 steps for the pre-training procedure by using 4 Nvidia Telsa V100-16GB GPUs.


For fine-tuning on downstream NLG tasks, we use Adam optimizer with a learning rate of 5×10-6. We set the batch size as 16 and 32 for question generation and abstractive summarization, respectively. When the target language is the same as the language of training data, we fine-tune all parameters. When the target language is different from the language of training data, we fine-tune the Transformer layers of the encoder. We truncate the input sentences to the first 256 tokens. During decoding, we use beam search with a beam size of 3, and limit the length of the target sequence to 80 tokens.

4.2 Question Generation

We evaluate our model on zero-shot cross-lingual answer-aware question generation (QG). The goal is to generate a question that asks towards the answer with the given passage and the expected answer. In the following experiments, we extend the QG task to the cross-lingual setting. By only using English QG training data, our goal is to generate questions in English or Chinese with the given passage-answer pair in English or Chinese.

We use SQuAD 1.1 (?) as the English QG dataset. It is a popular English question answering dataset containing over 100,000 questions and their corresponding annotated passages. Following (?), we regard the original development set as the test set, and sample 5000 examples from the training data of two datasets as the development sets. For Chinese QG, we follow the default data splits of WebQA (?). We regard the provided annotated evidence sentences as the input passages instead of entire documents. To construct the input sequence, we view the whole input passage as a single sentence, and concatenate the passage and the answer into one sequence with a special token [S] between them. During decoding Chinese, we utilize a subset of vocabulary, which is obtained from the passage sentences of the WebQA dataset.

English-English Question Generation

We first conduct experiments on the supervised English-English QG setting. We compare our model to the following baselines:

  • CorefNqg (?) An attentional sequence-to-sequence model with a feature-rich encoder.

  • Mp-Gsn (?) A sequence-to-sequence model with self-attention and maxout pointer mechanism.

  • Xlm (?) State-of-the-art cross-lingual pre-trained Transformer. We initialize the sequence-to-sequence model with pre-trained XLM.

We evaluate models with BLEU-4 (BL-4), ROUGE (RG) and METEOR (MTR) metrics. As shown in Table 1, Xnlg outperforms the baselines, which demonstrates that our pre-trained model provides a good initialization for NLG.

Chinese-Chinese Question Generation

We conduct experiments on zero-shot Chinese-Chinese QG to evaluate the cross-lingual transfer ability. In this task, models are trained with English QG data but evaluated with Chinese QG examples. We include the following models as our baselines:

  • Xlm Fine-tuning XLM with the English QG data.

  • Pipeline (Xlm) The pipeline of translating input Chinese sentences into English first, then performing En-En-QG with the XLM model, and finally translating back to the Chinese. We use the Transformer as the translator, which is also trained on the MultiUN dataset.

  • Pipeline (Xlm) with Google Translator Utilizing Google Translator in Pipeline (Xlm) for translation.

We evaluate models by both automatic evaluation metrics and human experts. The automatic metrics scores are computed by regarding each Chinese character as a token. For human evaluation, we consider three metrics: relatedness, fluency, and correctness, which are represented as integers ranged from 1 to 3. We randomly select 100 passage-answer pairs from the English QG test set, and use the models to generate questions. Then we present these examples to three experts to ask for the above scores. In Table 2 and Table 3, we present the results for the zero-shot Zh-Zh-QG. The results of monolingual supervised models are also reported in Table 1 as reference. In the automatic evaluation, our model consistently performs better than baselines in both zero-shot and monolingual supervised setting. In the human evaluation, our model also obtains significant improvements in terms of relatedness and correctness.

English-Chinese Question Generation

In the zero-shot English-Chinese question generation experiments, we use Xlm and Pipeline (Xlm) as our baselines. Pipeline (Xlm) is a pipeline method that uses En-En-QG with Xlm to generate questions, and then translates the results to Chinese. Because there are no annotations for En-Zh-QG, we perform human evaluation studies for this setting. Table 4 shows the human evaluation results, where our model surpasses all the baselines especially in terms of relatedness and correctness.

Chinese-English Question Generation

We also conduct experiments for zero-shot Chinese-English question generation, and adopt the same evaluation procedure to En-Zh-QG. Pipeline (Xlm) first translates Chinese input to English, and then conduct En-En-QG with Xlm. As shown in Table 5, human evaluation results indicate that Xnlg achieves significant improvements on the three metrics.

4.3 Abstractive Summarization

We conduct experiments on cross-lingual abstractive summarization (AS). AS is the task of converting the input sentences into summaries while preserving the key meanings. For evaluation, we use English/French/Chinese Gigaword22 2 LDC2011T07, LDC2011T10, LDC2011T13 to extract the first sentence and the headline of each article, and regard them as input document and predicted summaries, respectively. For each language, we sample 500k/5k/5k examples for training/validation/test.

Zero-Shot Summarization

In the zero-shot setting, we only use English data for training, and directly evaluate the model on other languages. In Table 7 and Table 8, we present the results for French/Chinese AS, which are evaluated by the ROUGE-1, ROUGE-2 and ROUGE-L metrics. We also report the results of supervised AS in Table 6 for reference. We find that Xnlg outperforms all the baseline models on both French and Chinese AS. Comparing with French, there is a larger gap between baselines and our model on zero-shot Chinese AS, which indicates that the error propagation issue is more serious on distant language pairs.

4.4 Ablation Studies

Effects of Pre-Training

Models BL-4 MTR RG-L
Xlm 0.25 0.62 2.56
Xnlg 16.37 18.74 34.93
    - XAE 13.71 15.88 31.43
    - DAE 0.38 1.79 3.79
Table 9: Ablations for pre-training objectives, where models are evaluated on zero-shot Chinese-Chinese question generation. Same shorthands apply as in Table 1.

We conduct ablation studies for pre-training objectives, and the results can be seen in Table 9. We observe that our model greatly benefits from the DAE objective for the zero-shot Chinese question generation task. The results also demonstrate that combining DAE and XAE can alleviate the spurious correlation issue and improves cross-lingual NLG.

Effects of Fine-Tuning Strategies

Supervised En-En-QG Zero-Shot Zh-Zh-QG
All 19.99 24.05 48.74 6.82 14.84 21.77
Dec 16.37 20.91 44.51 0.21 1.25 2.05
Enc 19.62 23.66 48.78 15.72 18.89 34.82
ET 19.69 23.73 48.53 16.37 18.74 34.93
Table 10: Effects of different fine-tuning strategies. Dec, Enc and ET represent fine-tuning the parameters of the decoder, the encoder, and the Transformer layers of the encoder, respectively. Same shorthands apply as in Table 1.

As shown in Table 10, we use the En-En-QG and Zh-Zh-QG tasks to analyze the effects of using different fine-tuning strategies. It can be observed that fine-tuning encoder parameters, our model obtain an impressive performance for both English and Chinese QG, which shows the strong cross-lingual transfer ability of our model. When fine-tuning all the parameters, the model achieves the best score for English QG, but it suffers a performance drop when evaluating on Chinese QG. We find that fine-tuning decoder hurts cross-lingual decoding, and the model learns to only decode English words. For only fine-tuning decoder, the performance degrades by a large margin for both languages because of the underfitting issue, which indicates the necessity of fine-tuning encoder.

Effects of Cross-Lingual Transfer

Figure 3: ROUGE-2 scores for few-shot French/Chinese abstractive summarization with different training data sizes.
Figure 4: Examples of generated questions by Xnlg and the baselines in four directions (En-En,En-Zh,Zh-En and Zh-Zh). “*”: Because Xlm is not designed for cross-lingual NLG, it is hard to produce meaningful sentences for En-Zh-QG and Zh-Zh-QG.

We examine whether low-resource NLG can benefit from cross-lingual transfer. We consider English as the rich-resource language, and conduct experiments for few-shot French/Chinese AS. Specifically, we first fine-tune Xnlg on the English AS data, and then fine-tune it on the French or Chinese AS data. We compare with the monolingual supervised model that Xnlg is only fine-tuned on the dataset of the target language. As shown in Figure 3, we can observe that the cross-lingual supervision improves performance for few-shot abstractive summarization. As the training data size becomes larger, the performances of the two models are getting closer.

4.5 Case Studies

As shown in Figure 4, we present some examples generated by Xnlg and the baselines in four directions (En-En, En-Zh, Zh-En, and Zh-Zh). When decoding on an unseen language, Xlm tends to generate random output, because it is not designed for cross-lingual NLG. In terms of the pipeline model, we can observe that it suffers from the error propagation issue, especially when the source and target languages are all different from the training data. For example, when the pipeline model performs Zh-Zh-QG, keywords are translated twice, increasing the risk of mistranslation. In the second example, “atomic bomb” is mistranslated to “nuclear bomb”, resulting in its low correctness. On the contrary, by directly transferring English supervision signals to the other generation directions, the generated questions of Xnlg match the references better than baselines.

5 Conclusion

In this paper, we propose a pre-training method for cross-lingual natural language generation (NLG) that can transfer monolingual NLG supervision signals to all pre-trained languages. With the pre-trained model, we achieve zero-shot cross-lingual NLG on several languages by only fine-tuning once. Experimental results show that our model outperforms the machine-translation-based pipeline model on several cross-lingual NLG tasks. For future work, we would like to improve our pre-training method towards the fully unsupervised setting.


Prof. Heyan Huang is the corresponding author. The work is supported by NKRD (No. 2018YFB1005100), NSFC (No. 61772076 and 61751201), NSFB (No. Z181100008918002), Major Project of Zhijiang Lab (No. 2019DH0ZX01), and Open fund of BDAlGGCNEL and CETC Big Data Research Institute Co., Ltd (No. w-2018018).