Unsupervised Pre-training for Natural Language Generation: A Literature Review

  • 2019-11-13 16:09:59
  • Yuanxin Liu, Zheng Lin
  • 19


Recently, unsupervised pre-training is gaining increasing popularity in therealm of computational linguistics, thanks to its surprising success inadvancing natural language understanding (NLU) and the potential to effectivelyexploit large-scale unlabelled corpus. However, regardless of the success inNLU, the power of unsupervised pre-training is only partially excavated when itcomes to natural language generation (NLG). The major obstacle stems from anidiosyncratic nature of NLG: Texts are usually generated based on certaincontext, which may vary with the target applications. As a result, it isintractable to design a universal architecture for pre-training as in NLUscenarios. Moreover, retaining the knowledge learned from pre-training whenlearning on the target task is also a non-trivial problem. This reviewsummarizes the recent efforts to enhance NLG systems with unsupervisedpre-training, with a special focus on the methods to catalyse the integrationof pre-trained models into downstream tasks. They are classified intoarchitecture-based methods and strategy-based methods, based on their way ofhandling the above obstacle. Discussions are also provided to give furtherinsights into the relationship between these two lines of work, someinformative empirical phenomenons, as well as some possible directions wherefuture work can be devoted to.


Quick Read (beta)

Unsupervised Pre-training for Natural Language Generation: A Literature Review

Yuanxin Liu1,2, Zheng Lin1
1Institute of Information Engineering, Chinese Academy of Sciences
2School of Cyber Security, University of Chinese Academy of Sciences

Recently, unsupervised pre-training is gaining increasing popularity in the realm of computational linguistics, thanks to its surprising success in advancing natural language understanding (NLU) and the potential to effectively exploit large-scale unlabelled corpus. However, regardless of the success in NLU, the power of unsupervised pre-training is only partially excavated when it comes to natural language generation (NLG). The major obstacle stems from an idiosyncratic nature of NLG: Texts are usually generated based on certain context, which may vary with the target applications. As a result, it is intractable to design a universal architecture for pre-training as in NLU scenarios. Moreover, retaining the knowledge learned from pre-training when learning on the target task is also a non-trivial problem. This review summarizes the recent efforts to enhance NLG systems with unsupervised pre-training, with a special focus on the methods to catalyse the integration of pre-trained models into downstream tasks. They are classified into architecture-based methods and strategy-based methods, based on their way of handling the above obstacle. Discussions are also provided to give further insights into the relationship between these two lines of work, some informative empirical phenomenons, as well as some possible directions where future work can be devoted to.


[table]capposition=top \newfloatcommandcapbtabboxtable[][\FBwidth]


Unsupervised Pre-training for Natural Language Generation: A Literature Review

  Yuanxin Liu1,2, Zheng Lin1 1Institute of Information Engineering, Chinese Academy of Sciences 2School of Cyber Security, University of Chinese Academy of Sciences


noticebox[b]Preprint. \[email protected]

1 Introduction

Unsupervised pre-training has sparked a sensational research interest in the natural language processing (NLP) community. This technology provides a promising way to exploit linguistic information from large-scale unlabelled textual data, which can serve as an auxiliary prior knowledge to benefit a wide range of NLP applications. In the literature, language modeling (LM) is a prevalent task for pre-training, where the target words are predicted conditioned on a given context. Therefore, it is intuitive to employ the pre-trained LMs for natural language generation, as the pre-training objective naturally accords with the goal of NLG. However, revolutionary improvements are only observed in the field of NLU.

The primary factor that impedes the progress of unsupervised pre-training in NLG is an idiosyncratic nature of text generation: Basically, we do not write words from scratch, but instead based on particular context, e.g., the source language sentences for translation, the dialog histories for response generation, and the visual scenes for image captioning, among others. In unsupervised pre-training, the task-specific context is not available, which leads to a discrepancy between pre-training and training in the target task. More precisely, the challenges posed by the discrepancy can be reflected in two aspects: First, the diverse context makes it intractable to design a universal representation extractor as in the case of NLU, and the pre-trained language generators may have to modify their inner structures to deal with the task-specific context. Second, the mismatch in data distribution and objective between the two training stages might result in the performance on the pre-training tasks being compromised during fine-tuning, which is dubbed as the catastrophic forgetting problem Goodfellow et al. (2013).

In response to the above challenges, two lines of work are proposed by resorting to architecture-based and strategy-based solutions, respectively. Architecture-based methods either try to induce task-specific architecture during pre-training (task-specific methods), or aim at building a general pre-training architecture to fit all downstream tasks (task-agnostic methods). Strategy-based methods depart from the pre-training stage, seeking to take advantage of the pre-trained models during the process of target task learning. The approaches include fine-tuning schedules that elaborately design the control of learning rates for optimization, proxy tasks that leverage labeled data to help the pre-trained model better fit the target data distribution, and knowledge distillation approaches that ditch the paradigm of initialization with pre-trained parameters by adopting the pre-trained model as a teacher network.

The remainder of this review is organized as follows: In Section 2, we will introduce the background knowledge about unsupervised pre-training for NLU, followed by a sketch of how the pre-trained models are employed through parameter initialization for NLG in Section 3. In Section 4, we will describe the architecture-based methods, and the strategy-based methods are presented in Section 5. Section 6 provides some in-depth discussions, and Section 7 concludes this review.

2 Background: Unsupervised Pre-training for NLU

Learning fine-grained language representations is a perennial topic in natural language understanding. In restrospect, compelling evidences suggest that good representations can be learned through unsupervised pre-training.

Early work focused on word-level representations Mikolov et al. (2013); Pennington et al. (2014), which encodes each word independently. For sentence-level representations, there are roughly two kinds of pre-training objectives, namely discriminative pre-training and generative pre-training. Discriminative pre-training distinguishes context sentence(s) for a given sentence from non-context sentence(s) Devlin et al. (2019); Logeswaran and Lee (2018), with the aim to capture inter-sentence relationships. Generative pre-training follows the language model paradigm:

maxθt=1TlogP(xt|C;θ) (1)

where xt is the tth word in the textual sequence to generate, T indicates sequence length, θ stands for learnable parameters, and C is the context information, which is defined by the pre-training objective. ELMo Peters et al. (2018) and GPT (short for Generative Pre-training) Radford et al. (2018) adopt uni-directional Transformer Vaswani et al. (2017) and bi-directional LSTM Hochreiter and Schmidhuber (1997) language models, respectively. In this case, the context is defined as x1:t or xt+1:T. BERT Devlin et al. (2019) is trained with a novel masked language model (MLM), which is a non-autoregressive way of generation. Specifically, MLM randomly replaces a fixed proportion of tokens in each sentence with a special [MASK] token or a random token, which results in a corrupted sentence Xmask, and predicts each replaced token based on the same context Xmask. To alleviate the inconsistency with target tasks caused by the introduction of [MASK] token, XLNet Yang et al. (2019b) introduces permutation-based language model, which conducts autoregressive language modeling over all possible permutations of the original word sequence. This gives rise to a context C=X𝐳1:t-1, where 𝐳 is a certain permutation of [1,2,,T], according to the definitions in Yang et al. (2019b). Dai and Le (2015) and Kiros et al. (2015) pre-trained an encoder-decoder framework to reconstruct the input sentence and the surrounding sentence, respectively, and the encoded input sentence thereby is included in the context C.

The sentence representations learned by LMs 11 1 In the rest of this review, we refer to unsupervised pre-training as the generative LMs, unless otherwise specified. can be used to perform many NLU tasks by adding a simple linear classifier. Despite the objective of language modeling, the pre-trained representations and have successfuly pushed the state-of-the-art on multiple benchmarks .

3 Unsupervised Pre-training and Parameter Initialization for NLG

NLG systems are usually built with an encoder-decoder framework, where the encoder reads the context information and the decoder generates the target text from the encoded vectorial representations. A direct way to utilize the pre-trained models is to initialize part of the encoder (when dealing with textual context) and/or the decoder with pre-trained parameters. For the encoder, pre-training is expected to provide better sentence representations, as we discussed in Section 2. For the decoder, the intuition is to endow the model with some rudimentary ability for text generation.

Liu and Lapata (2019) employed BERT as the encoder for abstractive text summarization, with some additional techniques to help integrate the BERT-initialized encoder with the randomly initialized decoder, which we will explicate in Section 5.1. GPT-2 Radford et al. (2019) inherited the left-to-right LM pre-training objective from GPT and extended the application to NLG, where the pre-trained LM directly serves as the language generator, with some special symbols to identify task-specific contexts. In the case of zero-shot task transfer, preliminary experiments showed that straightforward adaption of GPT-2 compares unfavorably with other unsupervised baselines.

Ramachandran et al. (2017) is among the first attempts to investigate unsupervised pre-training for sequence to sequence (Seq2Seq) learning. They used pre-trained LSTM-based LMs to initialize the first layer of the encoder and the decoder, which act as representation extractors. An additional LSTM layer, which is randomly initialized, is then added on top of the pre-trained LMs to build the Seq2Seq framework. To make use of the text generation ability of LMs, the output softmax layer of the decoder LM is also retained. Some recent endeavours Lample and Conneau (2019); Rothe et al. (2019) explored multiple combinations of GPT- and BERT-based models to initialize the encoder and the decoder, respectively. Although remarkable results are observed, the separately pre-trained LMs are still inconsistent with the Seq2Seq framework.

4 Architecture-based Methods

4.1 Inducing Task-Specific Architecture in Pre-training

Separately initializing the encoder and the decoder with LMs neglects the interaction between the two modules at the pre-training stage, which is sub-optimal. For NLG tasks that can be modeled as Seq2Seq learning, it is feasible to jointly pre-train the encoder and the decoder. Existing approaches for this sake can be categorized into three variants: Denoising autoencoders (DAEs), conditional masked language models (CMLMs) and sequence to sequence language models (Seq2Seq LMs).

Figure 1: Overview of pre-training tasks for Seq2Seq learning. Left: Denoising autoencoder Wang et al. (2019) takes a corrupted sentence as input and reconstructs the original sentence. The positions of x1 and x2 are switched by shuffling, x3 is replaced with the token x^, and x4 is deleted. Middle: Conditional masked language model Song et al. (2019) masks several consecutive tokens in a sentence before feeding it to the encoder, and the input sentence to the decoder is constructed with the unmasked tokens in the encoder side being masked. Right: Seq2Seq language model Dong et al. (2019) is composed of a single Transformer model, which takes the concatenation of a source sentence and a target sentence as input. The special tokens inserted between sentences are omitted for conciseness. The tokens in the source sentence can attend to each other, while each token in the target sentence can only attend to the source tokens and the target tokens on its left. The MLM-like training objective is used.

4.1.1 Denoising Autoencoder

An intuitive way to conduct unsupervised Seq2Seq learning is to train an autoencoder (AE) based on encoder-decoder framework. Different from AEs, DAEs take a corrupted sentence as input and reconstruct the original sentence. The advantage is that the corrupted input will force the decoder to extract relevant information from the source side for text generation. To obtain the corrupted sentence, Wang et al. (2019) designed three noising functions: shuffle, delete and replace (the left plot of Figure 1 gives an illustration), each of which is controlled by a pre-defined probability distribution. To be more specific, each token in the raw sequence is assigned with a new index based on a gaussion distribution N(0,σ); the delete and replace operations of a token are determined by a Bernoulli distribution B(p) with Beta distribution as prior. The three functions are applied to the raw sequences in random order.

4.1.2 Conditional Masked Language Model

CMLM Song et al. (2019) extends the single model MLM proposed by Devlin et al. (2019) to the encoder-decoder setting, where the masked text sequence is read by the encoder, and the decoder only reconstructs the masked tokens, in construct to the entire sequence in DAEs. As the middle plot of Figure 1 shows, CMLM masks consecutive tokens 22 2 CMLM can also be implemented in other manners Submission (2020); Ghazvininejad et al. (2019), where discrete tokens in the textual sequence are masked., and the unmasked tokens in the encoder side are masked when being feed to the decoder. Following the notations in Song et al. (2019), let us assume that the tokens with index from u to v are masked from the raw sentence X, which results in X\u:v, and Xu:v denotes the decoder input. Then, when predicting each masked token xt (utv), the context is X<tu:v and X\u:v. The underlying motivation, as Song et al. (2019) argued, is to force the encoder to understand the meaning of the unmasked tokens, which is achieved by encoder side masks, and encourage the decoder to refer to the source information rather than the leftward target tokens, which is achieved by decoder side masks.

4.1.3 Sequence to Sequence Language Model

Seq2Seq LM Dong et al. (2019) performs Seq2Seq modeling using a single Transformer model, with the concatenation of source sentence and target sentence as input. To simulate Seq2Seq learning with encoder-decoder frameworks, the attention span of each target token is constrained to the source tokens and the leftward target tokens, which is achieved by self-attention masks (see the right plot of Figure 1). In this way, the ability to extract language representation and generate texts are integrated into a single model. It is worth mentioning that Seq2Seq LM does not auto-regressively generate the target sentence, but instead predicting masked tokens based on the contexts controlled by self-attention masks. In other words, Seq2Seq LM still belongs into the family of MLMs. Apart from Seq2Seq LM, Dong et al. (2019) also explored uni-directional LM and bi-directional LM structures to perform the MLM-based cloze task, and incorporated the three kinds of LMs to build the final pre-training objective.

4.2 Encoder-Agnostic Architectures for Adaptation

Although the Seq2Seq-based pre-training methods exhibit strong performance, they are confined to text-to-text generation. In order to encompass more diverse contexts, some researches began to investigate encoder-agnostic pre-training architectures Golovanov et al. (2019); Ziegler et al. (2019). Context Attention and Pseudo Self-Attention are two typical variants presented by Ziegler et al. (2019), which differ in the way that the task-specific context is injected (see Figure 2). Context Attention takes the form of a standard Transformer decoder, with the layer that attends to the encoder outputs being randomly initialized. Pseudo Self-Attention considers the context vectors and the previous layer decoder outputs as an integral input, and the attended results are computed as follows:

PSA(C,Y)=softmax((YWq)[CWkcYWky])[CWvcYWvy] (2)

where C|C|×dc and Y|Y|×dy are the context vectors and representations of the target textual sequence, respectively. The linear transformation matrices Wkc,Wvc|C|×dmodel with respect to C are added to project the context to the self-attention space, and Wq,Wky,Wvy|Y|×dmodel are part of the pre-trained model.

Except for the performance on target tasks, an alternative metric to gauge the quality of encoder-agnostic architectures is the degree to which the pre-trained parameters have to change, in order to inject the task-specific context. Ziegler et al. (2019) compared the parameter changes of Context Attention and Pseudo Self-Attention in the feed forward layer, and discovered that Pseudo Self-Attention is more robust under this evaluation.

Figure 2: Overview of two encoder-agnostic architecture variants. The solid blocks with white backgrounds are randomly initialized. Residual connection and layer normalization are omitted for simplicity. Left: Context Attention follows the Transformer decoder side attention over the encoder outputs. Right: The encoder outputs are incorporated as part of the decoder side Pseudo Self-Attention.

5 Strategy-based Methods

5.1 Fine-tuning Schedules for Adaption

When the pre-trained model is only a part of the target task system, fine-tuning requires joint learning of the components initialized in different fashion, which can make the training process unstable. The pre-trained model may also suffer from aggravated catastrophic forgetting problem as it has to coordinate with other components during fine-tuning Edunov et al. (2019); Yang et al. (2019a). From the perspective of optimization, it is unreasonable to schedule the pre-trained components and the newly-introduced components with the same learning rate, considering that the former have already possessed some unique knowledge. A common assumption is that the pre-trained parameters should be updated at a slower learning rate and with smoother decay Liu and Lapata (2019); Yang et al. (2019a). The rationale behind such setting is that fine-tuning with more accurate gradient can prevent the pre-trained parameters from deviating too faraway from the original point, and the newly-introduced components need to quickly converge to the target parameter space. To this end, Liu and Lapata (2019) adopted two Adam optimizers with different learning rates for the pre-trained encoder and the randomly initialized decoder. The learning rates are scheduled as in Vaswani et al. (2017) with different warming up steps:

lrEnc=l~rEncmin(step-0.5,stepwarmupEnc-1.5)lrDec=l~rDecmin(step-0.5,stepwarmupDec-1.5) (3)

where warmupEnc/Dec and l~rEnc/Dec determine the speed of learning rate changes and the max learning rates, respectively.

5.2 Proxy Tasks for Adaption

Large-scale unlabelled data provides generic linguistic knowledge, but the target tasks have unique data distribution and objectives. An effective way to bridge this gap is to introduce proxy tasks with moderate changes to the pre-training objectives, but at the same time take the labeled data into account Lample and Conneau (2019); Submission (2020). Translation Language Modeling (TLM) Lample and Conneau (2019) is a special generalization of MLM in the cross-lingual situation. It leverages the paralleled machine translation corpus for further training of the LMs that are pre-trained on monolingual corpora. Specifically, the source language sentence and the corresponding target language sentence are fed to the model in parallel, with random tokens from each language being masked to perform the cloze-style prediction as in MLM. Different from monolingual MLM, TLM encourages word predictions to rely on the interdependence from two languages, therefore the sentence representations learned from separate languages can be well aligned.

For some particular NLG tasks, existing proxy tasks designed under the supervised setup can also work with unsupervised pre-training models. For instance, in neural text summarization, the combination of extractive and abstractive33 3 Extractive text summarization selects sub-sequences from the source texts, while abstractive text summarization treats the task as a Seq2Seq learning problem. objectives can generate better summaries Li et al. (2018); Gehrmann et al. (2018). Inspired by this, Liu and Lapata (2019) introduced extractive summarization as a proxy task to fine-tune the pre-trained BERT, before adopting it as the abstractive summarization encoder. Compared with the original BERT features, the representations learned from extractive summarization contain more task-specific information, therefore conveying the meaning of source texts better.

5.3 Knowledge Distillation for Adaption

The aforementioned methods are diverse in implementation, but share the common idea of employing the pre-trained models through parameter initialization. An alternative way to exploit the pre-trained models is using the knowledge distillation technique Hinton et al. (2015). Knowledge distillation is a special form of training, where a student network learns from the supervision signals produced by a teacher network.

Taking BERT as an example, the pre-trained MLM contains global information, which can teach the autoregressive Seq2Seq models to “see from the future” Submission (2020). In practice, the probability distribution predicted by BERT is regarded as a soft label to compute the cross-entropy loss function :

kdcross(θ)=-t=1|Y|w𝒱[P(yt=w|Ymasked,X;ϕ)logP(yt=w|Y1:t-1,X;θ)] (4)

where X, Y and Ymasked are the source sequence, the raw target sequence and the masked target sequence, respectively. 𝒱 denotes the output vocabulary. θ indicates the parameters of the student network (Seq2Seq), which are learnable, and ϕ indicates the BERT parameters, which are fixed. In this way, the knowledge from unsupervised pre-training can be flexibly transferred to the target tasks, dispensing with the size and architecture limitations.

The supervision can also be derived from the hidden representations Yang et al. (2019a), with a mean-squared-error (MSE) distillation loss:

kdmse=-hmbert-hnseq2seq22 (5)

where m and n are hyper-parameters denoting the layer subscripts. Compared with the probability soft labels, the representation distillation method requires the Seq2Seq model to have the same hidden size with BERT, which is a more strict constrain.

Combining the knowledge distillation loss and the standard generative loss for Seq2Seq learning gives rise to the final objective to optimize:

(θ)=αkd(θ)+(1-α)seq2seq(θ) (6)

where α is the weighting term that balances the contribution of the two kinds of loss functions.

6 Discussions

6.1 The Relationship between Architecture- and Strategy-based Methods

We have analysed two major challenges faced by the application of unsupervised pre-training to NLG (see Section 1). On this basis, we introduced existing methodologies from the architecture and strategy considerations. The architecture-based methods are mainly proposed in response to the first challenge. Since the architecture of pre-trained model has a significant effect on the downstream task (when the pre-trained parameters are used for initialization), architecture designings have to plan in advance to narrow the discrepancy between pre-training and training on target tasks. This motivation has shown great effectiveness on the Seq2Seq framework Wang et al. (2019); Song et al. (2019); Dong et al. (2019). The strategy-based methods focus on the second challenge. They take a postprocessing point of view, with the aim to make the best of the pre-trained model at the target task training stage. It is noteworthy that the challenges are not independent inherently, and the two types of methods can actually work as complement to each other. For example, the fine-tuning schedules can alleviate the negative effects caused by the modification of pre-trained structures, and the catastrophic forgetting problem can also seek solution by devising a general task-agnostic architecture.

6.2 Experimental Phenomenons

Existing researches on unsupervised pre-training for NLG are conducted on various tasks for different purposes. Probing into the assorted empirical results may help us discover some interesting phenomenons:

  • The advantage of pre-training gradually diminishes with the increase of labeled data Ramachandran et al. (2017); Wang et al. (2019); Song et al. (2019).

  • Fixed representations yield better results than fine-tuning in some cases Edunov et al. (2019).

  • Overall, pre-training the Seq2Seq encoder outperforms pre-training the decoder Edunov et al. (2019); Wang et al. (2019); Lample and Conneau (2019); Rothe et al. (2019).

The first two phenomenons attest to the catastrophic forgetting theory. Thanks to the access to large-scale unlabeled corpora, unsupervised pre-training is able to excel at zero/low-shot settings, while the pre-trained models can only achieve few gains when abundant labeled data is available. This can be explained by the high quality of the dataset and the capacity of the task-specific models, which leave little space for improvement. Nonetheless, the increased supervision from labeled data can also influence the performance on pre-training tasks. By fixing the pre-trained parameters, the learned representations will not be affected by the numerous iterations of training on the target task, which makes them work better without fine-tuning.

The third phenomenon is kind of counter-intuitive, as the generative pre-training objectives are more similar to the decoder’s function. There is no unanimous theory to explain why the encoder is a more important element to pre-train. But this discovery suggests that the pre-trained LMs are more robust when acting as representation extractors, while they are more sensitive the the change of context when acting as conditional language generators.

6.3 Future Directions

The diversity of NLG applications poses challenges on the employment of unsupervised pre-training, yet it also raises more scientific questions for us to explore. In terms of the future development of this technology, we emphasize the importance of answering four questions: 1) How to introduce unsupervised pre-training into NLG tasks with cross-modal context? 2) How to design a generic pre-training algorithm to fit a wide range of NLG tasks? 3) How to reduce the computing resources required for large-scale pre-training? 4) What aspect of knowledge do the pre-trained models provide for better language generation?

NLG tasks can be defined by the context features and mapping functions. The introduction of cross-lingual textual features Lample and Conneau (2019) and task-specific Seq2Seq architectures Song et al. (2019); Wang et al. (2019); Dong et al. (2019) in the pre-training stage has successfully boosted the performance on text-to-text generation. For NLG tasks concerning multiple modalities, it is conceivable that pre-training methods could also benefit from the joint consideration of cross-modal features. For example, in the vision-and-language field, the learning of cross-modal representations has proven to be highly effective Liu et al. (2019); Lu et al. (2019), but such representations can not yet be extracted from unpaired images and texts for image-grounded text generation, to the best of our knowledge.

In NLU, it is possible to pre-train one model to obtain language representations once and for all. As for NLG, a task-agnostic pre-training algorithm should transcend the purpose of representation learning, and consider the general ability for language generation. The notion of “encoder-agnostic adaption” Ziegler et al. (2019) makes a preliminary step towards this direction, but still remains far from approaching the equivalent performance as its NLU counterparts Peters et al. (2018); Devlin et al. (2019); Radford et al. (2018); Yang et al. (2019b).

Due to the colossal scale of the pre-training corpora, including a large number of parameters is essential to achieve favorable performance. As a result, the model size usually costs at least 8 GPU cards Dong et al. (2019); Song et al. (2019); Lample and Conneau (2019) in the pre-training for NLG systems, and it also hinders real-world applications. To reduce the memory consumption problem, existing work resorted to knowledge distillation to transfer the knowledge from a large teacher network to a small student network Sun et al. (2019); Jiao et al. (2019), or parameter reduction techniques to prune the model size in a more direct way Lan et al. (2019). However, the research context is limited to the NLU scenarios, and same endeavours are necessary to NLG applications.

Another important branch of researches on unsupervised pre-training in NLP try to explain what kind of knowledge can be learned from pre-training. Related work has been done on the basis of both language understanding Jawahar et al. (2019); Petroni et al. (2019) and generation See et al. (2019). Specially, See et al. (2019) analysed the characters of texts generated from a pre-trained GPT-2 by evaluating them over a wide spectrum of metrics. We argue that deeper understanding the way in which unsupervised pre-training contributes to better text generation, and the intrinsic mechanisms of the pre-trained models are also crucial to future work.

7 Conclusion

Unsupervised pre-training has defined the state-of-the-arts on a variety NLP tasks. However, in the field of NLG, the diversity of context information is still impeding the the application of unsupervised pre-training. The major challenges exist in designing model architectures to cater for the assorted context, and retaining the general knowledge learned from pre-training. In this review, we survey the recent unsupervised methods to utilize large-scale corpora for NLG purposes, with a highlight on those aiming at facilitating the integration of pre-trained models with downstream tasks. We propose to classify them into architecture- and strategy-based methods, followed with detailed introductions and discussions of their pros and cons. Based on the comparison of these methods and analyses of some informative experimental results from previous publications, we summarize some scientific questions that has not yet been well understood, and suggest attention being paid to these questions by future work.


  • A. M. Dai and Q. V. Le (2015) Semi-supervised sequence learning. In NIPS, pp. 3079–3087. Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), pp. 4171–4186. Cited by: §2, §4.1.2, §6.3.
  • L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon (2019) Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, Cited by: Figure 1, §4.1.3, §6.1, §6.3, §6.3.
  • S. Edunov, A. Baevski, and M. Auli (2019) Pre-trained language model representations for language generation. In NAACL-HLT (1), pp. 4052–4059. Cited by: §5.1, 2nd item, 3rd item.
  • S. Gehrmann, Y. Deng, and A. M. Rush (2018) Bottom-up abstractive summarization. In EMNLP, pp. 4098–4109. Cited by: §5.2.
  • M. Ghazvininejad, O. Levy, Y. Liu, and L. Zettlemoyer (2019) Mask-predict: parallel decoding of conditional masked language models. In EMNLP, Cited by: footnote 2.
  • S. Golovanov, R. Kurbanov, S. I. Nikolenko, K. Truskovskyi, A. Tselousov, and T. Wolf (2019) Large-scale transfer learning for natural language generation. In ACL (1), pp. 6053–6058. Cited by: §4.2.
  • I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio (2013) An empirical investigation of catastrophic forgetting in gradient-based neural networks. CoRR abs/1312.6211. Cited by: §1.
  • G. E. Hinton, O. Vinyals, and J. Dean (2015) Cited by: §5.3.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §2.
  • G. Jawahar, B. Sagot, and D. Seddah (2019) What does BERT learn about the structure of language?. In ACL (1), pp. 3651–3657. Cited by: §6.3.
  • X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu (2019) TinyBERT: distilling BERT for natural language understanding. CoRR abs/1909.10351. Cited by: §6.3.
  • R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, R. Urtasun, A. Torralba, and S. Fidler (2015) Skip-thought vectors. In NIPS, pp. 3294–3302. Cited by: §2.
  • G. Lample and A. Conneau (2019) Cross-lingual language model pretraining. CoRR abs/1901.07291. Cited by: §3, §5.2, 3rd item, §6.3, §6.3.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019) ALBERT: A lite BERT for self-supervised learning of language representations. CoRR abs/1909.11942. Cited by: §6.3.
  • W. Li, X. Xiao, Y. Lyu, and Y. Wang (2018) Improving neural abstractive document summarization with explicit information selection modeling. In EMNLP, pp. 1787–1796. Cited by: §5.2.
  • F. Liu, Y. Liu, X. Ren, X. He, K. Lei, and X. Sun (2019) Aligning visual regions and textual concepts: learning fine-grained image representations for image captioning. In NIPS, Cited by: §6.3.
  • Y. Liu and M. Lapata (2019) Text summarization with pretrained encoders. CoRR abs/1908.08345. Cited by: §3, §5.1, §5.2.
  • L. Logeswaran and H. Lee (2018) An efficient framework for learning sentence representations. In ICLR (Poster), Cited by: §2.
  • J. Lu, D. Batra, D. Parikh, and S. Lee (2019) ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NIPS, Cited by: §6.3.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111–3119. Cited by: §2.
  • J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In EMNLP, pp. 1532–1543. Cited by: §2.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In NAACL-HLT, pp. 2227–2237. Cited by: §2, §6.3.
  • F. Petroni, T. Rocktäschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, and S. Riedel (2019) Language models as knowledge bases?. CoRR abs/1909.01066. Cited by: §6.3.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding with unsupervised learning. In Technical report, OpenAI, Cited by: §2, §6.3.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Cited by: §3.
  • P. Ramachandran, P. J. Liu, and Q. V. Le (2017) Unsupervised pretraining for sequence to sequence learning. In EMNLP, pp. 383–391. Cited by: §3, 1st item.
  • S. Rothe, S. Narayan, and A. Severyn (2019) Leveraging pre-trained checkpoints for sequence generation tasks. CoRR abs/1907.12461. Cited by: §3, 3rd item.
  • A. See, A. Pappu, R. Saxena, A. Yerukola, and C. D. Manning (2019) Do massively pretrained language models make better storytellers?. CoRR abs/1909.10705. Cited by: §6.3.
  • K. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2019) MASS: masked sequence to sequence pre-training for language generation. In ICML, Proceedings of Machine Learning Research, Vol. 97, pp. 5926–5936. Cited by: Figure 1, §4.1.2, 1st item, §6.1, §6.3, §6.3.
  • A. Submission (2020) Distilling the knowledge of bert for text generation. In ICLR, Cited by: §5.2, §5.3, footnote 2.
  • S. Sun, Y. Cheng, Z. Gan, and J. Liu (2019) Patient knowledge distillation for BERT model compression. CoRR abs/1908.09355. Cited by: §6.3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §2, §5.1.
  • L. Wang, W. Zhao, R. Jia, S. Li, and J. Liu (2019) Denoising based sequence-to-sequence pre-training for text generation. In EMNLP, Cited by: Figure 1, §4.1.1, 1st item, 3rd item, §6.1, §6.3.
  • J. Yang, M. Wang, H. Zhou, C. Zhao, Y. Yu, W. Zhang, and L. Li (2019a) Towards making the most of BERT in neural machine translation. CoRR abs/1908.05672. Cited by: §5.1, §5.3.
  • Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le (2019b) XLNet: generalized autoregressive pretraining for language understanding. CoRR abs/1906.08237. Cited by: §2, §6.3.
  • Z. M. Ziegler, L. Melas-Kyriazi, S. Gehrmann, and A. M. Rush (2019) Encoder-agnostic adaptation for conditional language generation. CoRR abs/1908.06938. Cited by: §4.2, §4.2, §6.3.