Quantifying Exposure Bias for Neural Language Generation

  • 2020-02-08 19:18:44
  • Tianxing He, Jingzhao Zhang, Zhiming Zhou, James Glass
  • 0


The exposure bias problem refers to the training-generation discrepancy,caused by teacher forcing, in maximum likelihood estimation (MLE) training forauto-regressive neural network language models (LM). It has been regarded as acentral problem for neural language generation (NLG) model training. Although alot of algorithms have been proposed to avoid teacher forcing and therefore` toalleviate exposure bias, there is little work showing how serious the exposurebias problem actually is. In this work, we first identify the self-recoveryability of MLE-trained LM, which casts doubt on the seriousness of exposurebias. We then propose sequence-level (EB-bleu) and word-level (EB-C) metrics toquantify the impact of exposure bias. We conduct experiments for theLSTM/transformer model, in both real and synthetic settings. In addition to theunconditional NLG task, we also include results for a seq2seq machinetranslation task. Surprisingly, all our measurements indicate that removing thetraining-generation discrepancy only brings very little performance gain. Inour analysis, we hypothesise that although there exist a mismatch between themodel distribution and the data distribution, the mismatch is still in themodel's "comfortable zone", and is not big enough to induce significantperformance loss.


Quick Read (beta)

Quantifying Exposure Bias for Neural Language Generation

Tianxing He    Jingzhao Zhang    Zhiming Zhou    James Glass

The exposure bias problem refers to the training-generation discrepancy, caused by teacher forcing, in maximum likelihood estimation (MLE) training for auto-regressive neural network language models (LM). It has been regarded as a central problem for neural language generation (NLG) model training. Although a lot of algorithms have been proposed to avoid teacher forcing and therefore‘ to alleviate exposure bias, there is little work showing how serious the exposure bias problem actually is. In this work, we first identify the self-recovery ability of MLE-trained LM, which casts doubt on the seriousness of exposure bias. We then propose sequence-level (EB-bleu) and word-level (EB-C) metrics to quantify the impact of exposure bias. We conduct experiments for the LSTM/transformer model, in both real and synthetic settings. In addition to the unconditional NLG task, we also include results for a seq2seq machine translation task. Surprisingly, all our measurements indicate that removing the training-generation discrepancy only brings very little performance gain. In our analysis, we hypothesise that although there exist a mismatch between the model distribution and the data distribution, the mismatch is still in the model’s “comfortable zone”, and is not big enough to induce significant performance loss.

Machine Learning, ICML

1 Introduction

Language model (LM) is a central module for natural language generation (NLG) tasks (trends-nlp) such as machine translation (wu17ganmt), dialogue response generation (dialogue17jiwei), image captioning (coco14tsung), etc. For decades, maximum likelihood estimation (MLE) has been the the most widely used objective for LM training. However, there is a popular belief in the natural language processing (NLP) community that standard MLE training has the “exposure bias” problem and will lead to a performance degradation during test-time generation. The exposure bias problem (ss15bengio; seqtrainrnn16marc) refers to the following discrepancy between MLE training and test-time generation for auto-regressive language models: During training, the model is trained to predict the next word conditioned on prefix (or history) words sampled from the ground-truth data distribution; However during generation, the model generates words conditioned on prefix sequences generated by the model itself. Hence, due to the exposure to real data prefix during training, the language model could be biased to only perform well with ground-truth data prefixes. We illustrate this discrepancy in Figure 1 (we will establish notations in Section 2.1). As a result, during generation the errors would accumulate along the generated sequence, and the distribution generated by the model will be incrementally distorted. The forced exposure to ground-truth data during training is also referred to as “teacher forcing”.

Figure 1: An illustration of the training-generation discrepancy.

Given its defintion, the exposure bias problem could rise in the general cases when the model needs to make a sequence of decisions or generations, e.g., music/pixel/speech generation (Lamb2016ProfessorFA). In this work, we focus on the task of natural language generation, where the exposure bias problem is originally proposed (ss15bengio), and has since attracted huge research attention. In order to avoid teacher forcing, many training algorithms (ss15bengio; Lamb2016ProfessorFA; seqtrainrnn16marc; yu2016seqgan; zhu2018texygen; cot18sidi; rankgan17; leakgan17jiaxian; advtext17sai; seq2seqbeam16sam; nie2018relgan; zhan18irlgan) have been proposed as alternatives to MLE training. Most of these works utilize techniques from generative adversarial networks (GANs) (Goodfellow14gan) or reinforcement learning (RL) (sutton98rl). In this paper, we refer to these algorithms as non-MLE methods or text GANs. Despite the huge research efforts devoted to alleviate exposure bias, surprisingly, the existence or significance of exposure bias is much less studied. In particular, to the best of our knowledge, no existing work attempts to directly show the seriousness of exposure bias in an empirical or theoretical way. This work is motivated by the belief that a good solution should be built upon a testable and quantifiable problem definition. Beginning in Section 3, we first identify the “self-recovery” ability of popular LM models, which conflicts with the original claim of exposure bias. We then develop sequence-level and word-level quantification methods for exposure bias, and use them to validate its seriousness in controlled experiments.

2 Preliminaries

In this section, we establish notations and describe the datasets we use.

2.1 Notations

The task of auto-regressive language modelling is to learn the probability distribution of the (l+1)th word Wl+1 in a sentence W conditioned on the word history W1:l:=(W1,,Wl). Here, we use WiV to denote a discrete random variable distributed across the vocabulary V. For simplicity, we assume all sentences are of length L. Given a training dataset D consisting of sentences of length L, the standard MLE training aims to minimize the negative log-likelihood (NLL) objective below:

LMLE=𝔼WD-1LΣl=0L-1logPM(Wl+1|W1:l), (1)

where PM(|W1:l) denotes the conditional distribution of Wl+1 of PM given a prefix W1:l. The MLE objective can be easily extended for a seq2seq NLG task such as machine translation (ilya14seq; cho-al-emnlp14), which we omit here for brevity. We denote the distribution of a MLE-trained LM as PM, which is the major subject of this study, and the ground-truth data distribution as PD. We experiment with two popular model architectures: LSTM (lstm-hochreiter1997long; lstmlm-speech) and transformer LM (alexei18adaptive; transformerxl19zihang). Since the transformer architecture potentially has better utilization of the history context, the degree of how serious its generation is affected by exposure bias could be different from LSTM. Our quantification mainly relies on the measurements of the distance from the model’s generation distribution to the data distribution. Hence we define the following notations to simplify expressions. Let 𝒫 denote the set of probability distributions on the vocabulary V. Let d denote a distance measure between distributions (e.g. total variation distance), d:𝒫×𝒫0. We adopt two popular probability metrics: total variation distance (denoted as dTV), and Jensen-Shannon divergence (denoted as dJS).

2.2 Datasets

In this work we will have experiments in real-data settings and in a synthetic setting. The majority of the real-data experiments will be conducted on the wiki-103 dataset11 1 link to the wiki-103 dataset. It has around 1.8m sentences / 101m words for training, and 4k sentences / 241k words for testing. We favour the wiki-103 dataset because it is large-scale and has long (over 30 words) paragraphs, which is useful for the measurements of exposure bias. It is also among the most popular datasets for LM bench-marking. It is also interesting to consider the conditional NLG case with seq2seq models (ilya14seq; cho-al-emnlp14). We will use the IWSLT2014 German-to-English translation dataset22 2 http://workshop2014.iwslt.org/. It has around 160k sentences / 3.7m words for training, and 6.7k sentences / 150k words for validation or testing (in English). For the synthetic setting, we will use a model trained on the EMNLP-News dataset. It refers to the EMNLP 2017 WMT News Section, which has around 268k sentences / 7.5m words for training and 10k sentences / 277k words for testing. It has been widely used in text GAN literature (yu2016seqgan; cot18sidi).

3 Motivation: The Self-recovery Ability

In this section, we perform a “prefix-switching” experiment aiming to validate the seriousness of exposure bias. We find that it reveals a surprising self-recovery ability in MLE-trained LMs. Before diving into the experiment, we want to emphasize that although it has been reported that text GANs have superior performance than MLE training, we should not simply conclude that exposure bias is indeed a serious issue. The reason is that we do not know the exact underlying reason for the performance gain. For example, despite the huge success of the batch normalization technique in deep learning, whether “internal covariate shift” (which is the motivation of batch-norm) exists in deep neural network training remains a question (SanTsi18How). In this work, we seek a direct way to validate the seriousness of the exposure bias problem.

Model Samples as Prefix Model Samples
When asked about how she thinks about the games, Flocke dislikes most of those about it, citing
instances of paranoia in her heart and trembling temper, which infuriated him.
Approximately 500 Finns became sick since early October when sleeping in their sleeping bags. On 3
October, the ”Red Guard”, which had been organized two months previously by Marius Kuusinen …
The entire key results of the arc be obtained through unifying methods to construct the prologue,
three pieces could be combined instead of need to provide a final chapter.
Data Samples as Prefix Model Samples
Most of what is known of Du Fu <unk> s life is clear and graphic descriptions, memoirs, commentaries
on storyboards, and descriptions of Canadian settlers. More than 60 biographies and …
In the early 730s, he travelled in the Jiangsu province of Asia after Ashras ibn Abdallah al-Sulami visited
Quanzhou in Bukhara, the capital of Turkmenistan and a native of the …
Since the Song dynasty, critics have called Du Fu the ”master poet”, a product of his use of Du Fu scenes
to establish the empress’s nature and to emphasize his …
Shuffled Data Samples as Prefix Model Samples
is Du of <unk> s known Fu of life what Most claimed was his tragic adaptation of John Ching’s The
Janus of Hades, translated by disgraced performer just months before …
in the he travelled, the early Jiangsu In 730s, he attended a mission on the peninsula. He soon
moved to Monkwearmouth, on the northern shores of Baffin Bay in The …
Since, the called have Song Fu critics dynasty Du Fu, who Zhang historians have included, have
not rivaled HABS’s Web site held for 253 years. In 2015, HABS-based producers Oronoco …
Random Sequences as Prefix Model Samples
…RANDOM… surface leader Game after a failed attempt to test her effectively in three fleets
falling to I-30. This went unnoticed by most ichthyologists; none understood either strict rules …
…RANDOM… faster elephant emperor decorations with Rocky Mountain state exploit by linking
all black geese to 1970s planning regulations that prohibit slaughter of snake species.
…RANDOM… hitting remained prominently from the system as she witnessed no mention
of criteria in the text. Douglas Turner noted then that Gottesfeld may have assumed …

Table 1: Samples of a MLE-trained transformer LM when fed with different types of length-10 prefixes. To save space, we omitted the first 7 words of the random prefix. We observe that the model self-recovers from distorted prefixes instead of accumulating errors.

We design the “prefix-switching” experiment as follows: We feed a MLE-trained transformer LM on the wiki-103 dataset (alexei18adaptive) with four types of length-10 prefixes: model’s own samples, test data samples, test data samples shuffled on word-level, or samples from a uniform random distribution. Then we let the model continue the generation given these prefixes and compare the quality of the samples in a qualitative manner. The prefix-switching experiment aims to validate the following claim that immediately follows from the original claim of exposure bias: During generation, if we set the prefix distribution to be the ground-truth data distribution instead of the model’s own distribution (now that there is no discrepancy between training and generation), then the model’s generation quality should be much better. In the extreme case of shuffled or random prefixes, we expect the model to generate at least equally distorted sequences. The samples and prefixes are shown in Table 1. On the contrary to our expectation, we do not observe noticeable difference in sample quality comparing samples from model and data prefixes. More surprisingly, the model is still able to generate relevant and fairly high-quality samples from shuffled prefixes. Even in the extreme case where random sequences are fed, the samples are still fairly reasonable. In Appendix A, we provide more samples for interested readers, including samples from a LSTM LM. Due to the recent increasing interest of solving exposure bias in the field of neural machine translation (NMT) (ebmt19wen), we repeat the above experiment in a standard NMT setting, and get similar observations. These experiments suggest that the MLE-trained auto-regressive LMs have the self-recovery ability, i.e., the model is able to recover from artificially distorted history input, and generate samples with reasonable quality. This phenomenon is clearly in contradiction with the popular claim of exposure bias that the error induced by the mismatch between model and data distribution should, on the contrary, accumulate during the generation process. Motivated by these experiments, in the following sections, we turn to more rigorous methods to quantify the significance of exposure bias. Note that our quantification approaches will be independent of the training procedure and only require inference from the trained model.

4 Sequence-level Quantification

In this section, we propose a quantification approach that formalizes the comparison in the prefix-switching experiment (Table 1). We will use a sequence-level metric, in particular, the BLEU score (papineni-etal-2002-bleu), to compare the quality of the model’s samples when different types of prefixes are fed.

4.1 Approach

Since the key idea is to compare the generation quality with different types of prefixes, denoting the optional prefix distribution as PH{PM,PD}, we first formalize the following generation process:

  • Given a prefix length l and a prefix distribution PH, we sample W1:l from PH.

  • Conditioned on the prefix W1:l, we sample Wl+1:L from PM, where Wl+j is sampled from PM(|W1:l+j-1) with j>0.

We denote the marginal distribution of Wl+1:L of the above random process as PM|HWl+1:L. If exposure bias is indeed a serious problem, we expect that the quality of Wl+1:L is better when PD is used as PH than PM. To be more specific, we expect PM|DWl+1:L to be closer to the ground-truth PDWl+1:L than PM|MWl+1:L, where PDWl+1:L is simply the marginal distribution of Wl+1:L in PD. The question left is how do we measure the distance between PM|HWl+1:L and PDWl+1:L, because we only have access to samples from these two distributions. We adopt the corpus-bleu metric, which has been widely used in text GAN literature (yu2016seqgan; shortgan18massimo), to measure the quality of model-generated texts by comparing them to a set of real-data texts. Given a set of generated sentences and a large number of sentences from ground-truth data as references, corpus-bleu returns the average BLEU score (papineni-etal-2002-bleu) of every model generated sentence with the reference set. A higher corpus-bleu score means that the model’s generation has better quality in that it has higher ngram-level overlap with the data distribution. With these ingredients in hand, we define the following quantification for exposure bias:

EB-bleu(M,l)=corpus-bleu(PM|DWl+1:L,PDWl+1:L)corpus-bleu(PM|MWl+1:L,PDWl+1:L) (2)

EB-bleu(M,l) reflects the relative performance gain in BLEU score when the length-l prefix is from PD instead of from PM. Assuming that exposure bias is indeed serious, we expect EB-bleu(M,l) to be significantly larger than 1, and it should become larger as the prefix length l increases.

4.2 Results and Discussion

Prefix Length (l) 5 10 15 20 25 30
C-bleu(MLS|D) 39.81 ± 0.11 39.23 ± 0.13 38.42 ± 0.10 37.43 ± 0.09 36.31 ± 0.13 34.80 ± 0.12
C-bleu(MLS|MLS) 38.90 ± 0.10 38.41 ± 0.11 37.73 ± 0.12 36.82 ± 0.13 35.70 ± 0.11 34.31 ± 0.09
EB-bleu (MLS) 1.023 ± 0.003 1.023 ± 0.004 1.020 ± 0.003 1.017 ± 0.004 1.018 ± 0.005 1.016 ± 0.004
C-bleu(MTF|D) 56.81 ± 0.11 57.03 ± 0.12 56.84 ± 0.10 56.51 ± 0.13 56.32 ± 0.15 55.82 ± 0.09
C-bleu(MTF|MTF) 56.33 ± 0.10 56.22 ± 0.11 56.01 ± 0.11 55.83 ± 0.12 55.01 ± 0.12 55.12 ± 0.12
EB-bleu (MTF) 1.009 ± 0.002 1.013 ± 0.003 1.012 ± 0.002 1.013 ± 0.002 1.014 ± 0.002 1.013 ± 0.003
C-bleu(MTF|Dshuf) 55.63 ± 0.11 55.04 ± 0.12 54.30 ± 0.13 53.69 ± 0.10 53.01 ± 0.12 52.29 ± 0.09
EB-bleu (MTF|Dshuf) 1.020 ± 0.002 1.035 ± 0.002 1.044 ± 0.003 1.052 ± 0.001 1.060 ± 0.003 1.066 ± 0.003
Table 2: EB-bleu (defined in Equation 2) measurements on the wiki-103 dataset. MLS refers to the LSTM model, and MTF refers to the transformer model. C-bleu(M|H) refers to corpus-bleu(PM|HWl+1:L,PDWl+1:L).

To prepare a MLE-trained PM, we use the code of Transformer-XL (transformerxl19zihang) to train a SOTA transformer LM on the wiki-103 dataset. The model is a 16-layer transformer-xl model with a hidden dimension of 410 and an inner dimension of 2100. Since the computation of corpus-bleu requires large amounts of unseen real-data samples as references, we use half of the wiki-103 training data (around 900k sentences and 50m words) to train the model PM, and save the other half as samples from PD (used as reference for corpus-bleu). More training details are provided in Appendix D. The resulting model PM has a test-set PPL (perplexity) of 27.81 (if trained on full training data, the PPL will be 24.02). In addition, we also train a 3-layer LSTM LM with a hidden layer dimension of 600, which has a test-set PPL of 34.80. We show the EB-bleu measurements33 3 The reproducing code is provided in supplementary materials. with different prefix length l in the upper part of Table 2. The 3-gram BLEU score (BLEU-3) is used. We show the mean and standard deviation as error bar from 10 runs with different random seeds, and for each run 10k samples from the model are used to calculate corpus-bleu with 10k data samples as references. It is shown that the transformer model has higher corpus-bleu scores comparing to the LSTM LM, which is as expected. However, the EB-bleu measurements are merely around 1.01 or 1.02, for both the LSTM or transformer model. It means that even if ground-truth data prefix is fed to the model (removing the training-generation discrepancy), the relative gain for corpus-bleu is only around 1% or 2%. This agrees with our observation in Table 1. Moreover, the ratio does not become significantly larger as the prefix length grows. These measurements indicate that exposure bias is only a minor problem for MLE-trained LM. We then check whether “worse” prefix would induce larger performance loss. Similar to the prefix-switching experiment (Table 1), we feed the transformer model with word-level shuffled data prefix, and then compute corpus-bleu for the samples, denoted as corpus-bleu(MTF|Dshuf). Likewise, we compute the ratio between corpus-bleu(MTF|D) and corpus-bleu(MTF|Dshuf), denoted as EB-bleu(MTF|Dshuf), and report them in the lower part of Table 2. We find that the measured EB-bleu(MTF|Dshuf) is much larger than EB-bleu(MTF), which follows our intuition that when the mismatch in prefix quality is large enough, it will indeed induce significant performance loss in the model’s generation. With these observations, we put forward the following hypothesis to explain the insignificant influence of exposure bias indicated by the EB-bleu measurements:

Hypothesis 1.

The mismatch between PM and PD as prefix distributions is not large enough to induce significant performance loss in the model’s generation.

We will further validate this hypothesis in Section 5.2.

5 Word-level Quantification

There are two potential weaknesses of EB-bleu: (1) It’s basically measuring the quality of the marginal distribution of Wl+1:L, which doesn’t reflect how the generation is consistent with the given prefix W1:l. This problem also causes EB-bleu to be inapplicable for seq2seq tasks, which we explain in more detail in Appendix B. (2) Since the computation of EB-bleu involves the generation of a partial sentence Wl+1:L, one could argue that exposure bias also takes effect during this partial generation both when PM or PD is used as PH, leading to little difference in the sequence-level quality of generation. To avoid these shortcomings, and to get a more complete picture of exposure bias, we propose a word-level quantification method which focuses on the model’s conditional generation distribution of Wl+1 given a prefix W1:l.

5.1 Approach

Again, let PH{PM,PD} denote the prefix distribution. With a given prefix length l, we first define the conditional generation deviation (CGD) as the expected distance, measured by metric d, between PM and PD conditioned on the prefix samples from PH:

CGD(M|H,l,d)=𝔼W1:lPH[d(PM(|W1:l),PD(|W1:l))] (3)

A smaller CGD value suggests a higher-quality conditional word distribution. For the choice of the distance metric d, in addition to the standard dTV and dJS metrics, we introduce greedy decoding divergence (dGD) defined as:

dGD(P,Q)=𝟙(argmaxiPiargmaxiQi) (4)

where 𝟙 is the indicator function, and P,Q𝒫. dGD44 4 dGD qualifies as a pseudo-metric in mathematics. reflects the model’s accuracy during greedy decoding. Similar to the case of EB-bleu, exposure bias should induce a significant gap between CGD(M|M,l,d) and CGD(M|D,l,d). We now define a new measurement for exposure bias at prefix length l with metric d to be:

EB-C(M,l,d)=CGD(M|M,l,d)CGD(M|D,l,d) (5)

EB-C describes the relative gain in CGD value when the prefix distribution is replaced by PD from PM. The measurements of EB-C can be interpreted in a similar way to EB-bleu: A large value of EB-C would indicate a serious impact of exposure bias; While a value close to 1 would suggest that exposure bias only has a minor effect. Since the computation of CGD (Equation 3) requires access to the data distribution PD, in the next section we first consider experiments in a synthetic setting, where a existing model is used as PD. Experiments in real-data setting will be presented in Section 5.3.

5.2 Synthetic Experiments and Discussion

In text-GAN literature (yu2016seqgan; rankgan17), a randomly-initialized one-layer LSTM model with a hidden dimension of 32 is usually used as PD in synthetic experiments (we denote this setting as M32rand). However, the model is small-scale and does not reflect any structure existing in real-world text. To improve upon this approach, we train a standard MLE model on the EMNLP-news data, and use it as PD for our synthetic setting. The model is a one-layer LSTM LM with a hidden dimension of 512. We then train two LSTM LM (PM) with different capacities using samples from the data model, with the standard MLE objective55 5 The reproducing code is provided in supplementary materials.. One is a one-layer LSTM with a hidden dimension of 512 (denoted as LSTM-512), the other one has a hidden dimension of 32 (denoted as LSTM-32). The small model scale of LSTM-32 should make it difficult to fully recover the data model. We show test-set perplexity results of the trained models in Appendix E. It is shown that both the LSTM-32 and LSTM-512 models has worse perplexity than the data model, indicating that the data model is not fully recovered by the training process. Finally, EB-C is calculated using 100k samples from PM and PD.

(a) LSTM-32
(b) LSTM-512
Figure 2: EB-C measurement (defined in Equation 5) for LSTM-32 and LSTM-512 models with different metrics. The average value of EB-C along prefix length is shown in the legend.

In Figure 2, EB-C measurements with different metrics d are shown. We also show the standard deviation as error bars from 5 runs with different random seeds. It is shown that the two models give similar measurements. The EB-C value has a slowly increasing trend as prefix length increases, which is expected as a consequence of exposure bias, i.e., PM should deviate farther from PD as prefix length increases. However, the average value of EB-C is only around 1.01 or 1.02, meaning that the gap between CGD(M|M,l,d) and CGD(M|D,l,d) is not large. Similar to the EB-bleu measurements, this indicates an insignificant impact from exposure bias.

Figure 3: CGD measurement (defined in Equation 3) for corrupted PM (with dTV) for the LSTM-512 synthetic experiment.

To dive deeper into the cause of the gap in CGD, and also to validate Hypothesis 1, we experiment with the prefix distribution PH being a corrupted versions of PM, denoted as PMcorrupt. We specify a corrupt rate c[0,1], and for each word in a sample prefix drawn from PM, we substitute it to a “noise” word drawn uniformly from the vocabulary with probability c. Consequently, larger c will cause the prefix distribution to deviate farther from the ground-truth PD. In Figure 3, we show CGD measurement with the corrupted prefix of different corrupt rates. Comparing to the small gap between CGD(M|M) and CGD(M|D), we observe larger gaps between CGD(M|Mcorrupt) and CGD(M|D). This again validates our hypothesis that the mismatch between the prefix distribution PM and PD is not large enough to induce a significant performance loss. In other words, PM has learned a “good enough” distribution that is able to keep the prefix in the well-behaving region during sampling. What kind of model has a large EB-C measurement? Below, we provide a typical toy example LM with a large EB-C value. However, we argue that this model is unlikely to be a product of MLE training.

Example 1.

Suppose L=2, and V={A,B}, the ground-truth data distribution is uniform on {AA,AB,BB,BA}. PM is crafted as follows: PM(W1=A)=0.9,PM(W2=A|W1=A)=0.9,PM(W2=A|W1=B)=0.5. Note that the model behaves worse when W1=A, which is of high probability during sampling.

For Example 1, we can easily get CGD(M|D,1,dTV)=0.2 and CGD(M|M,1,dTV)=0.36, which gives us EB-C(M,1,dTV)=1.8. However, this crafted model is unlikely to be an outcome of MLE training. The fact that PM(|W1=B) is better modeled suggests that in the training data, there are more sentences beginning with W1=B than W1=A. So MLE training should assign more probability to PM(W1=B), not the other way around. From this perspective, the claim of exposure bias seems to be conflicting with the MLE principle.

(a) CoT
(b) RankGAN
Figure 4: EB-C measurements (with dJS) for comparing non-MLE methods in the synthetic experiment.

Finally, we use EB-C to compare MLE and non-MLE training methods. We compare MLE against CoT (cot18sidi) and RankGAN (rankgan17) in the synthetic experiments. The results are shown in Figure 4. Note that the RankGAN experiments are conducted in the M32rand setting66 6 A MLE-trained model is used as the pre-trained model for the RankGAN generator. The MLE model has an oracle NLL of 8.67, and RankGAN’s oracle NLL is 8.55., as we find it hard to do a fast implementation of RankGAN for the LSTM-512 setting. We find that RankGAN and CoT gives lower EB-C measurements than MLE, which is expected, as these methods avoid teacher forcing. For CoT, at short prefix length, EB-C is even less than 1. We believe the reason is that CoT trys to make the model biased to behave better when fed with model samples. To the best of our knowledge, this is the first direct empirical evidence showing that non-MLE training does indeed alleviate the exposure bias problem. It also suggests that EB-C correctly reflect the significance of exposure bias. We believe the reason for why EB-C is still larger than 1 is that, text GANs still rely on MLE pre-training a lot. Finally, we want to note that although non-MLE algorithms avoid teacher forcing, these algorithms (using GAN or RL) are usually less stable and more difficult to tune. Given that our quantified measurements of exposure bias are insignificant, we think it should be questioned whether adopting these techniques to avoid exposure bias is a wise trade-off.

5.3 EB-C Measurements in Real-data Settings

As discussed in Section 5.1, to apply EB-C for a real-data setting, the only piece missing is the access to PD(|W1:l). In this section, we design a process to efficiently estimate EB-C for with real human as PD, by utilizing the Amazon Mechanical Turk (AMT) platform. Since it is clearly intractable for a turker to give us the distribution PD(|W1:l) due to the large size of the vocabulary, we focus on the greedy decoding divergence (dGD) metric (Equation 4), which only requires the turkers to give the most probable prediction for Wl+1. In our preliminary trials, we find it is still very hard for a person to guess the next word, even with real data prefix samples. The reason is that the vocabulary is very big, and the turkers may be not familiar with the context (e.g. wikipedia). Therefore, we design the following simplification: For a given prefix W1:l, we let the model (PM) output its top-5 prediction for Wl+1, then we only ask the turkers to choose among the 5 choices (the turker can also express that he/she thinks none of them is likely). Finally, we examine whether the turker’s choice is indeed the model’s top-1 prediction. This process is illustrated in Table 3. More details about the AMT setup are provided in Appendix F.

Prefix: on 8 september 2009 , dhani harrison appeared as a
guest on the tonight show with conan o ’brien to
Choices: 0: perform 1: help 2: announce 3: sing
4: promote 5: [None of the above is plausible]
Prefix: contingent units were accused of being armed ,
while some local peasants took claims of having been
Choices: 0: conscripted 1: armed 2: involved 3: tricked
4: intimidated 5: [None of the above is plausible]
Table 3: An illustration for the next-word prediction process on AMT. The choices are shuffled. The first prefix sample is from real data, and the second prefix sample is from the trained model.
Len (l) CGD(M|D) CGD(M|M) EB-C (NLG)
10 0.726 ± 0.004 0.738 ± 0.008 1.016 ± 0.014
20 0.735 ± 0.004 0.752 ± 0.004 1.022 ± 0.012
30 0.731 ± 0.007 0.752 ± 0.003 1.029 ± 0.010
Len (l) CGD(M|D) CGD(M|M) EB-C (NMT)
5 0.485 ± 0.005 0.482 ± 0.004 0.993 ± 0.008
10 0.563 ± 0.005 0.562 ± 0.009 0.999 ± 0.017
15 0.530 ± 0.008 0.526 ± 0.004 0.992 ± 0.019
Table 4: EB-C(M,l,dGD) measurements with human as PD. Upper: For the unconditional NLG task on the wiki1-103 dataset. Lower: For the NMT task on the IWSLT14 dataset.

We use this process to estimate EB-C for the NLG task on the wiki-103 dataset, and for the NMT task on the IWSLT2014 German-to-English dataset. For the wiki-103 dataset, we reuse the transformer-xl model, with the same configuration as in Section 4.2. For the IWSLT2014 dataset, we first truncate 10k samples from the training set to ensure that we will have enough unseen data samples during the computation of EB-C. We then follow the example code from Fairseq77 7 link to the Fairseq code, to train a 6-layer transformer encoder-decoder model with a hidden dimension of 512 and an inner dimension of 1024. We use beam-search of width 10 as the decoding method. The resulting model has a BLEU-4 score of 35.81 on the test set. The formulations of EB-C for NMT are basically very similar to the unconditional NLG case. The only modification is that the model prefix comes from decoding (beam-search), instead of trivial sampling. We defer the detailed formulations to Appendix C for brevity. For each prefix length and type (PM or PD) pair, we collect 3k dGD samples (via next-word prediction) from turkers via the AMT platform. Note that for NMT we select shorter prefix length because the sentences in the dataset are typically shorter. The results are shown in Table 4. We show the standard deviation as error bar from 5 evaluations. For the unconditional NLG case, the EB-C measurements are strikingly similar to the results in our synthetic experiments in that, removing the training-testing discrepancy only gives around 2% of relative performance gain. For the NMT case, it is shown that the EB-C measurements are not even larger than 1, indicating that exposure bias only has minimal effect. We believe the reason is that in NMT the mismatch in the prefix distribution is much smaller than the NLG case, because the source input is already constraining the output space strongly. These results further strengthen our conclusion that exposure bias is only a minor problem for MLE-based LM training.

6 Related Works

Despite the large amount of works (non-MLE methods) devoted to alleviate exposure bias, to the best of our knowledge, its actual impact has never been properly studied or validated in previous works. In a relevant direction, several recent works attempt to carefully evaluate whether the non-MLE training methods can give superior NLG performance than standard MLE training. shortgan18massimo tunes a “temperature” parameter in the softmax output, and evaluate models over the whole quality-diversity spectrum. evalgan18stanislau proposes to use “Reverse Language Model score” or “Frechet InferSent Distance” to evaluate the model’s generation performance. evalgan18guy proposes a method for approximating a distribution over tokens from GANs, and then evaluate models with standard LM metrics. These works arrive at a similar conclusion: The general performance of text GANs is not convincingly better, or even worse, than standard MLE training. Our work provides an explanation for these observations, that the exposure bias problem, which is the motivation of text GANs, is not serious enough to affect the generation performance significantly. We stress that only comparing the generative performance with text GANs is not enough to help us understand whether exposure bias is the central weakness of standard MLE training. It is also possible that exposure bias is indeed serious for MLE training, but text GAN does not solve the problem well enough (most text GAN algorithms still rely on MLE as pre-training).

7 Discussion

We first discuss the fundamental question “Is MLE training really biased?”, from the perspective of objective functions. Note that the MLE objective (1) can be re-written as:

argminθ𝔼WPD-1LΣl=0L-1logPM(Wl+1|W1:l)=argminθ𝔼WPD-logPM(W)=argminθ𝔼WPDlogPD(W)PM(W)=argminθDKL(PD||PM) (6)

where DKL denotes the Kullback-Leibler divergence, and θ denotes the trainable parameters in PM. Therefore, MLE training is minizing the divergence of PM, which is exactly the model’s sampling distribution, from PD. While it’s true that the training is “exposed” to data samples as prefix, we can not simply deduce the objective is “biased”. We want to end our discussion with two remarks. First, the proposed quantification approaches should not be used as the only performance metric for NLG. For example, a position-aware uni-gram LM, which generates words independent of previous context, has no exposure bias problem and can pass our test easily. Second, the intention of this work is not to discourage researchers from exploring non-MLE training algorithms for LM. It is completely possible that an training objective different from DKL(PD||PM), such as DJS(PD||PM), can lead to better generation performance (cot18sidi; sscritique18ferenc).

8 Conclusion

In this work, we aim to check whether exposure bias is indeed a serious problem for MLE-based auto-regressive LM training. We first identify the self-recovery ability of MLE-trained LM, which casts doubt on the seriousness of exposure bias. We then explore two intuitive approaches to quantify the significance of exposure bias, one on sequence-level (EB-bleu) and one on word-level (EB-C). All our measurements indicate that removing the training-generation discrepancy only brings very little performance gain. By analyzing the measurements with artificially perturbed prefix samples, we hypothesise that although the mismatch between the data and model prefix distribution exists, it is still in the model’s “comfortable zone”, and does not induce significant performance loss during generation. With these results, we conclude that on the contrary to the popular belief, exposure bias is only a minor problem in MLE-based LM training.


Appendix A The Self-recovery Abiity in NLG and NMT

In this section we provide more samples for the prefix-switching experiment. In Table 8, we provide more samples of the MLE-trained transformer LM model on the wiki-103 dataset (discussed in Section 3), when fed with different kinds of prefix. We also repeat the experiment for a standard LSTM LM trained on the EMNLP-News data, and samples are shown in Table 9. We then conduct the same experiment for a standard NMT setting. We follow the example code from Fairseq88 8 https://github.com/pytorch/fairseq/tree/master/examples/translation, to train a 6-layer encoder-decoder transformer model with a hidden dimension of 512 and an inner dimension of 1024 on the IWSLT14 German-to-English dataset. We feed the trained model with different types of prefixes during decoding which represents different levels of training-generation discrepancy. The results are shown in Table 5. Note that the source input is kept intact. The observations are very similar to our LM experiment, the data prefix does not seem to help, and in the extreme case of random prefix, the model still generates fairly good partial translation. In Section 3 we summarize this observation as the auto-recovery ability. To interpret the UNREL3 results in Table 5, note that we should not directly compare the translation generated from unrelated prefix to the reference translation. In fact, it is not fair to even compare part of it (e.g. the part after the length-3 prefix). Instead, we highlight the surprising fact that although the model is forced to begin (conditioned) with a wrong prefix, it still comes up with a reasonable translation. This is not an easy task even for human translators, yet the model does fairly well. Again, this contradicts with the claim of exposure bias that a MLE-trained LM would produce an increasingly deviated sequence when initiated with a non-perfect prefix. Actually, during generation the model self-recovers from the error in the prefix.

SOURCE: sobald der richter mich sah ,
REF: and as soon as i walked inside , the judge saw
me coming in .
DATA3: and as soon as the judge saw me .
NORMAL: as soon as the judge saw me .
UNREL3: what else is it that the judge saw me ?
RAND3: still take open action as the judge saw me .
SOURCE: ich fuhr also zum gericht .
REF: and i got in my car and i went to this courthouse .
DATA3: and i got to the court .
NORMAL: so i went to the court .
UNREL3: the reasons for me to go to the court .
RAND3: ge bor last year , i went to court .
SOURCE: ich bekam etwas angst vor technologie .
REF: i found myself becoming a little bit of a technophobe .
DATA3: i found myself a little scared of technology .
NORMAL: i got a little scared of technology .
UNREL3: um , my fear of technology was with me .
RAND3: kids - ds i got a little scared of technology .
SOURCE: das werde ich ihnen jetzt zeigen
REF: so i ’m going to try and show you what you really
get for 10 billion pixels .
DATA3: so i ’m going to show you this now .
NORMAL: this is what i ’m going to show you .
UNREL3: why did i show you that now ?
RAND3: told ct happening to you now .
Table 5: A standard NMT transformer model fed with different types of length-3 prefix. We did not do any cherry picking. “DATA” means the first three output tokens are forced to be from the reference. “NORMAL” means no prefix is forced during decoding. “UNREL” means the first three tokens are forced to be from another random unrelated sentence (which is wrong but grammatical). “RAND” means the first three tokens are completely random words. The given prefixes are underlined.

Appendix B Why EB-bleu should not be Used for Seq2Seq Tasks

We want to point out that it is not suitable to apply EB-bleu for models trained for seq2seq tasks. The reason is that the source input will make the output space for seq2seq tasks much more restricted, than the unconditional NLG task. In that case, giving the model ground-truth prefix would be too “cheating”. In Table 6, we provide examples to illustrate this point. We show two possible responses for a dialogue context and a translation input. Both of the responses are valid for the task, while response 2 happens to be the reference answer, and response 1 is a decoding sample from the model. Now, if we give the model the prefix of the reference answer, it will be too easy for the model to guess the remaining words. For example, it’s very easy to guess “a movie” after seeing the prefix “I watched”. This would result in a large EB-bleu value99 9 Since we the reference is available, we can use the standard BLEU score, instead of corpus-bleu.. The key reason behind it is that the source input is already restricting the output space to be very small. This comparison is unfair because the model’s sample is actually also legit.

Dialogue Response Generation
Context: Hey, what did you do yesterday ?
Possible Response 1 (Model) : I went to school .
Possible Response 2 (Reference) : I watched a movie .
Ground-truth Prefix: I watched
Machine Translation
Source (German): wir können das nicht einfach machen .
Possible Response 1 (Model) : We can’t just do that .
Possible Response 2 (Reference) : It is impossible for us
to do that .
Ground-truth Prefix: It is
Table 6: An illustration for why EB-bleu is unsuitable for seq2seq tasks.

For a fair comparison, what we really need is the reference answer conditioned on the prefix sampled from the model. For example, in the dialogue case, we need to measure whether “to school” is a good completion for the prefix “I went”, given the dialogue context. And that aspect is captured by the defintion of EB-C (Section 5).

Appendix C EB-C Formulations for Seq2Seq Tasks

In this section, we provide EB-C formulations for seq2seq tasks. We assume the dataset contains source-target pairs (S,W)PD. In the case of our NMT task, S refers to a German sentence, and W refers to the translated English sentence. The formulations of EB-C for NMT are basically very similar to the unconditional NLG case. The only difference is that the model prefix comes from decoding (beam-search), instead of trivial sampling:

CGDseq2seq(M|D,l,d)=𝔼(S,W)PD[d(PM(|W1:l,S),PD(|W1:l,S))]CGDseq2seq(M|M,l,d)=𝔼SPD[d(PM(|W1:ldec(S),S),PD(|W1:ldec(S),S))] (7)

where W1:ldec(S) refers to the decoding result (via beam-search) from PM given the source S. Finally, the EB-C definition is the same:

EB-Cseq2seq(M,l,d)=CGDseq2seq(M|M,l,d)CGDseq2seq(M|D,l,d) (8)

Appendix D Implementation Details of Transformer-XL, CoT, and RankGAN

In our preliminary tries, we find that the generation behavior of the transformer-xl model is not good when the prefix length is short (e.g. 5). We believe the reason is that in its training1010 10 https://github.com/kimiyoung/transformer-xl/blob/master/pytorch/run_wt103_base.sh the model is almost always fed with very long history context. So it’s behavior with short prefix becomes undefined. To alleviate that problem, we slightly modify the implementation by randomly emptying the history context of each mini-batch with a very small probability (0.03). In this way, the model is made aware of possible short context. We find that this modification is very effective and doesn’t have noticeable performance degradation. Other training configurations of transformer-xl (learning rate, batch size, etc.) are not changed. For CoT, we use a PyTorch implementation in https://github.com/pclucas14/GansFallingShort. We use a mediator model that has twice the size of the generator. We set M-step to be 4, and G-step to be 1. For RankGAN, we use a TensorFlow implementation in https://github.com/desire2020/RankGAN. Note that in our CoT and RankGAN experiments, the generator model is set to be the same size with the baseline MLE model. We tune the hyper-parameters using the corpus-bleu metric, which is widely used in text GAN literature.

Appendix E Perplexity of the Trained Models for the Synthetic EB-C Experiments

We show PPL results for model trained on EMNLP-news dataset in Table 7 for the synthetic EB-C experiments.

Model PPL
MLE Baseline (used as PD) 55.85
LSTM-512 (MLE, synthetic) 115.3
LSTM-32 (MLE, synthetic) 156.3
CoT-512 (synthetic) 115.6
RankGAN 53.43
Table 7: PPL results for model trained on EMNLP-news data-set.

Appendix F Details about the AMT Evaluation for EB-C

In this section we provide more details for the AMT evaluation discussed in Section 5.3. We show the human intelligence task (HIT) interface in Figure 5 and 6. Each HIT will include 10 pairs of context and its corresponding choices. Five of them are prefix samples from real data, and the other five is from the trained model. The prefix samples are mixed, so that the turker doesn’t know whether the prefix sample is from real data or the model. The next-word choices are also shuffled.

Figure 5: The HIT interface for our evaluation for NLG.
Figure 6: Examples in a HIT for NMT.

We collect around 15k HITs for each prefix length configuration. The same prefix sample is not repeated across the HITs. We limit each turker to do at most 200 HITs. For all prefix length configurations, there are around 300 unique turkers, and most turkers conduct less than 40 HITs.

Data Samples as Prefix Model Samples
He had a recurring role in 2003 on two episodes of the sitcom Roseanne as James
”Bitch” Cook and guest starred in the 1999 special Richard Ayoade’s comedy Canal & …
Du Fu <unk> s compassion, for himself and for others, which arrived at Du Fu soon
after the collapse of his political system. He was and still is called ”<unk>” …
Du Fu <unk> s work is notable above all for its use of the convention people
as arbiters in decision-making.
The tenor of his work changed as he developed his falsetto; the same February Bach performed
his entire Magnificat in the Domus Aurea, a musical hall in Campo Bartolommeo, a …
Although he wrote in all poetic forms, Du Fu explains that he had no intentions of
writing poetry, and attempted to cash in on the success of his two-volume translation …
About two thirds of Du Fu <unk> s 1500 extant works survive as collections, but about
one third have been rebuilt or linked. Some miniatures, such as the Memorial by …
According to the Encyclopædia Britannica, Du Fu <unk> s writings joined a theme of opposition to
social systems on the basis that the United States lacked standards and cotton-beets w ere …
Model Samples as Prefix Model Samples
. Competing at the 2006 Commonwealth Games, McBreen scored 3 . 87 goals per game, ranking
fourth from the conference in scoring, beaten 6 – 3 by Scotland. He also …
He, along with some young Christians from Poland, Romania, and East Germany, were taught to play
dilruba. In order to achieve this, the boy recorded 40 or 50 dilruba parts …
EEF service throughout this filter was to suffer. This was approximately with the British renewal and
capture charges on Mount Cissé, which contributed a large strength taking time to fall …
The matriarchal nature of the family is tested as opposed to that of their neighbors. In-laws
explain their position by having the rear bedroom bathed in bondage to reflect cosmic …
The branch office distributed tuition to the top level schools, gaining coverage in the art of
instruction in schools which allow them to select classes exclusively on the basis of …
Shuffled Data Samples as Prefix Model Samples
of Below an one of example is Du Fu <unk> s <unk> Système <unk> Système, also
the address of the No. 1 monuments Society and Advertising identifies a mass scale …
summarises his He <unk> by that Hung concluding life, let alone die. He ends by dying
as saying ”I died on the way”. Robert Penson has selected Hung’s last words …
<unk> top ten @ - @ became track group The the sixth ”Nation We ’ re
dedicated to at ten” based on race $15 @,@ 000 pre-determined event. The show …
An to designed music accompanying group the, video display was full on bluescreen and was rendered
with Ghibli HDTVs. For the Xbox 360’s Steam control, the cloud density was increased …
well You by <unk> received Kiss contemporary music was <unk> while Sobhi Youssef of
Sputnikmusic acted as a vocal coach for relation back to the original recordings of ”If” and …
Random Sequences as Prefix Model Samples
…RANDOM… execution love Author Churches Under Sunset and Angels <unk> post the
20 @,@ 000 To 30 @,@ 000 Arc landscape-crosses around 500 Enix areas …
…RANDOM… beyond spiders annually as part of regional zoning plans, including a
pie canning pool in Mechanicsburg, some <unk> Ellisburg, and boxes of all medical …
…RANDOM… realm unknown healthy-bred Spock (released in 1991 as The Return), arrives
in Sickbay to find a team; he engages in normal conversation (the main …
…RANDOM… rough elections appointment levels as he had already secured the if
no candidate received the season ticket, a result of the September 11 attacks. …
…RANDOM… / horses Finn ’ s experience of sexual frog foraging, and
might pose a threat to sexual preference as the crop Betsimisaraka earn ”<unk>
…RANDOM… Poland 1963 medium. Basu was the visual effects supervisor on 300
visual effects shots of Gangster, Feller’s seventh appearance in a Bollywood film. Aamir …
…RANDOM… levels MD defending her city of Beaufort, East Carolina in 2004.
At the same time, she responded against the package of short-form compatible boats …
Table 8: More samples of a STOA MLE-trained transformer LM (on the wiki-103 dataset) when fed with different kinds of prefix. To save space, we omitted the first 7 words of the random prefix.

Model Samples as Prefix Model Samples
it was only a pieces that had gone up to the forest and forces the shoppers about their chronic young
i mean we didn ’ t know what i haven ’ t considered through , ” she told bbc radio
if he were the president - elect , he was known that he would run a force in business at
but these are not as tired of ” the same message that the harry actor does have been hours in
first opinion the agent have taken four seconds , or if they don ’ t only know anything , were
” the economy of the uk is low enough of people of defending where americans think that ” brexit ,
the economy grew on 1 . 6 % since the us voted , and when it turned around 200 streets
i was able to produce on my own , which is good ; now that the theatre i ’ ve
” i ’ ve not buying boys i addressed many nervous times before , as a teenager made me is
we think about one - third of the struggles we actually want to see those very well that even more
the story of a album - which made public - was still fantastic , and for the second time in
” the test comes up before tuesday and when we ’ re feeling ahead again soon , ” she posted
a year on when he was last seen in his home and he did not see him , his suffering
brady has forced the 9 - known targets to get all - of - 12 gun migration and performing communication
i asked if he himself did , i managed to show all my charges at all , it used to
Data Samples as Prefix Model Samples
what this group does is to take down various different players in the future and we play in paris we
over 1 , 600 a day have reached greece this gone in 2013 and it planned to allow civilians on
” we ’ re working through a legacy period , and i am proud of the experience of the worker
’ the first time anyone says you need help , you don ’ t have put accurate press into the
out of those who came last year , 69 per cent of women can really take the drive to avoid
he has not played for tottenham ’ s first team this season then and sits down 15 - 0 with
so you have this man who seems to represent this bad story , which he plays minutes – because he
cnn : you made that promise , but it wasn ’ t necessarily at all the features he had in
this is a part of the population that is unk lucky to have no fault today , and it would
they picked him off three times and kept him out of the game and was in the field , the
the treatment was going to cost $ 12 , 000 as a result of the request of anyone who was
but if black political power is so important , why doesn ’ t we becomes the case that either stands
local media reported the group were not looking to hurt the animals , but would never be seen to say
Random Sequences as Prefix Model Samples
…RANDOM… big winter deserve , but they just say it your things goes wrong
…RANDOM… playoff north realise at its lowest level , improving their understanding in danger
…RANDOM… vital childhood registration , not previously planned for ¡unk¿ to each and reduced
…RANDOM… treated ship find one as an actual three points contained at a time
…RANDOM… faith five crazy schools and could give them a ” sleep ” necessary
…RANDOM… domestic jason follows a 12 - year cruise line over the christmas track
…RANDOM… ownership generous tourist accounts for more than 1 per cent every month -
…RANDOM… spending raped since the file returns in january , joining groups of foreign
…RANDOM… netflix worker four centre - and said facebook text ¡unk¿ to see how
…RANDOM… race labor witnessed is great , with more to an active the ¡unk¿
…RANDOM… treatments airlines hidden real - time out to sell on benefits to our
…RANDOM… intention short reflects showing the nature of flying in his space rather than
…RANDOM… conversation pace motion them further , but as late as they ’ ve
…RANDOM… export feb president obama agreements with president obama and her being on trump
…RANDOM… entering pocket hill and made it later in the united states and make
Table 9: Samples of a MLE-trained LSTM LM (on the EMNLP-news dataset) when fed with different kinds of prefix. To save space, we omitted the first 7 words of the random prefix.