In this work, we present an empirical study of generation order for machinetranslation. Building on recent advances in insertion-based modeling, we firstintroduce a soft order-reward framework that enables us to train models tofollow arbitrary oracle generation policies. We then make use of this frameworkto explore a large variety of generation orders, including uninformed orders,location-based orders, frequency-based orders, content-based orders, andmodel-based orders. Curiously, we find that for the WMT'14 English $\to$ Germantranslation task, order does not have a substantial impact on output quality,with unintuitive orderings such as alphabetical and shortest-first matching theperformance of a standard Transformer. This demonstrates that traditionalleft-to-right generation is not strictly necessary to achieve high performance.On the other hand, results on the WMT'18 English $\to$ Chinese task tend tovary more widely, suggesting that translation for less well-aligned languagepairs may be more sensitive to generation order.
Quick Read (beta)
An Empirical Study of Generation Order for Machine Translation
In this work, we present an empirical study of generation order for machine translation. Building on recent advances in insertion-based modeling, we first introduce a soft order-reward framework that enables us to train models to follow arbitrary oracle generation policies. We then make use of this framework to explore a large variety of generation orders, including uninformed orders, location-based orders, frequency-based orders, content-based orders, and model-based orders. Curiously, we find that for the WMT’14 English German translation task, order does not have a substantial impact on output quality, with unintuitive orderings such as alphabetical and shortest-first matching the performance of a standard Transformer. This demonstrates that traditional left-to-right generation is not strictly necessary to achieve high performance. On the other hand, results on the WMT’18 English Chinese task tend to vary more widely, suggesting that translation for less well-aligned language pairs may be more sensitive to generation order.
Neural sequence models (sutskever-nips-2014; cho-emnlp-2014) have been successfully applied to a broad range of tasks in recent years. While these models typically generate their outputs using a fixed left-to-right order, there has also been some investigation into non-left-to-right and order-independent generation in pursuit of quality or speed. For example, vinyals-iclr-2015 explored the problem of predicting sets using sequence models. While this is a domain where generation order should intuitively be unimportant, they nevertheless found it to make a substantial difference in practice. ford-emnlp-2018 explored treating language modeling as a two-pass process, where words from certain classes are generated first, and the remaining words are filled in during the second pass. They found that generating function words first followed by content words second yielded the best results. Separately, gu-iclr-2018 and lee-emnlp-2018 developed non-autoregressive approaches to machine translation where the entire output can be generated in parallel in constant time. These models do away with order selection altogether but typically lag behind their autoregressive counterparts in translation quality.
More recently, a number of novel insertion-based architectures have been developed for sequence generation (gu-arxiv-2019; stern-arxiv-2019; welleck-arxiv-2019). These frameworks license a diverse set of generation orders, including uniform (welleck-arxiv-2019), random (gu-arxiv-2019), or balanced binary trees (stern-arxiv-2019). Some of them also match the quality of state-of-the-art left-to-right models (stern-arxiv-2019). In this paper, we utilize one such framework to explore an extensive collection of generation orders, evaluating them on the WMT’14 English-German and WMT’18 English-Chinese translation tasks. We find that a number of non-standard choices achieve BLEU scores comparable to those obtained with the classical approach, suggesting that left-to-right generation might not be a necessary ingredient for high-quality translation. Our contributions are as follows:
We introduce a general soft order-reward framework that can be used to teach insertion-based models to follow any specified ordering.
We perform a thorough empirical study of various orders, including: uniform, random, left-to-right, right-to-left, common-first, rare-first, shortest-first, longest-first, alphabetical, and model-adaptive.
On the WMT 2014 English German task, we show that there is surprisingly little variation in BLEU for different generation orders. We further find that many orders are able to match the performance of a standard base Transformer.
Neural sequence models have traditionally been designed with left-to-right prediction in mind. In the classical setting, output sequences are produced by repeatedly appending tokens to the rightmost end of the hypothesis until an end-of-sequence token is generated. Though high-performing across a wide range of application areas, this approach lacks the flexibility to accommodate other types of inference such as parallel generation, constrained decoding, infilling, etc. Moreover, it also leaves open the possibility that a non-left-to-right factorization of the joint distribution over output sequences could outperform the usual monotonic ordering.
To address these concerns, several recent approaches have been proposed for insertion-based sequence modeling, in which sequences are constructed by repeatedly inserting tokens at arbitrary locations in the output rather than only at the right-most position. We use one such insertion-based model, the Insertion Transformer (stern-arxiv-2019), for our empirical study. We give a brief overview of the model in this section before moving on to the details of our investigation.
2.1 Insertion Transformer
The Insertion Transformer (stern-arxiv-2019) is a sequence-to-sequence model in which the output is formed by successively inserting one or more tokens at arbitrary locations into a partial hypothesis. This type of generation is made possible through the use of a joint distribution over tokens and slots. More formally, given an input and a partial output at time , the Insertion Transformer gives the joint distribution
where is the content being selected from the vocabulary and is the insertion location.
As its name suggests, the Insertion Transformer extends the Transformer model (vaswani-nips-2017) with a few key modifications to generalize from ordinary next-token modeling to joint token-and-slot modeling. First, the Insertion Transformer removes the causal attention mask from the decoder, allowing for fully contextualized output representations to be derived after each insertion. Second, the Insertion Transformer pads the length- decoder input on both ends so that output vectors are produced. It then concatenates adjacent pairs of output vectors to obtain slot representations, which in turn inform the conditional distributions over tokens within each slot, . Lastly, it performs an additional attention step over the slot representations to obtain a location distribution , which is multiplied with the conditional content distributions to obtain the full joint distribution: . A schematic of the architecture is given in Figure 1 for reference.
We note that stern-arxiv-2019 also experimented with a number of other architectural variants, but we use the baseline version of the model described above in our experiments for simplicity.
Once the model has been trained, it can be used for greedy autoregressive sequence generation as follows. At each step of decoding, we compute the joint
to determine what content should be inserted at which location . We then apply this insertion, increasing the sequence length by one, and repeat this process until an end-of-sequence token is produced. This is the serial decoding procedure shown in the left half of Figure 2.
The model can also be used for parallel partially-autoregressive decoding. Instead of computing the joint across all locations, we instead compute the best content for each location:
We then insert the highest-scoring tokens in parallel for all slots that are not yet finished, increasing the sequence length by anywhere between one and tokens. This strategy visualized in the right half of Figure 2.
3 Soft Order-Reward Framework
Having presented our model of interest, we now describe a general soft order-reward framework that can be used to train the model to follow any oracle ordering for sequence generation. Let be an order function mapping insertion actions to real numbers, where lower values correspond to better actions, and let be the probability assigned by the model to action . From these, we construct a reward function , an oracle policy , and a per-slot loss :
Here, is the set of all valid actions. The temperature controls the sharpness of the distribution, where results in a one-hot distribution with all mass on the best-scoring action under the order function , and results in a uniform distribution over all valid actions. Intermediate values of result in distributions which are biased towards better-scoring actions but allow for other valid actions to be taken some of the time.
Having defined the target distribution, we take the slot loss for insertions within a particular slot to be the KL-divergence between the oracle distribution and the model distribution . Substituting in for the slot loss within the training framework of stern-arxiv-2019 then gives the full sequence generation loss, which we can use to train an Insertion Transformer under any oracle policy rather than just the specific one they propose. We describe a wide variety of generation orders which can be characterized by different order functions in the subsections that follow. A summary is given in Table 1.
3.1 Uninformed Orders
We evaluate two uninformed orders, uniform and random. The uniform order gives equal reward or equivalently probability mass to any valid action. Consequently, this means we give each order a uniform probability treatment. We also experiment with random order , wherein we hash each word and use the sorted hash ID as the generation order. The random order forces the model to follow a specific random path, whereas the uniform order gives equal probability mass to any order.
3.2 Location-based Orders
We explore two types of location-based orders, balanced binary tree and monotonic orders. The balanced binary tree order encourages the model to place most of its probability mass towards the middle tokens in a missing span. Consequently, this encourages the model to generate text in a balanced binary tree order. We also experiment with soft monotonic orders , or soft left-to-right and soft right-to-left, which differ slightly from the left-to-right teacher forcing traditionally used in seq2seq. First, we still maintain a uniform roll-in policy (see Section 3.6), which increases diversity during training and helps avoid label bias. Additionally, this endows the model with the ability to “look back” and insert missing tokens in the middle of the sequence during inference, as opposed to always being forced to append only at one end of the sequence. The order reward is also soft (as described by the term above), wherein we do not place all the probability mass on the next monotonic token, but merely encourage it to generate in a monotonic fashion.
3.3 Frequency-based Orders
We evaluate two frequency-based orders: rare words first via and common words first via . For these orders, we simply sort the words based on their frequencies and used their rank as the order. We note the most frequent words tend to be punctuation and stop words, such as commas, periods, and “the” in English.
3.4 Content-based Orders
We also explore content-based orders. One class of orders is based on the word length: . This encourages the model to either emit all the shortest words first or all the longest words first.
We also explore alphabetical orderings , where sorting is based on Unicode order. We note that in Unicode, uppercase letters occur before lower case letters. This biases the model to produce words which are capitalized first (or last), typically corresponding to nouns in German. Additionally, for Chinese, the characters are roughly sorted by radical and stroke count, which bears a loose relation to the complexity and frequency of the character.
3.5 Model-based Orders
The orders presented thus far are static, meaning they are independent of the model. We also explore orders which are adaptive based on the model’s posterior. We also introduce “easy” and “hard” adaptive orders induced by . The adaptive orders look at the model’s posterior to determine the oracle policy. Consequently the loss is adaptive, as when the model updates after each gradient step, the order adapts to the model’s posterior.
In the “easy“ version, we use , which is similar to a local greedy soft EM loss. We renormalize our current model’s posterior over valid actions and optimize towards that distribution. This pushes the model’s posterior to what is correct and where it has already placed probability mass. Intuitively, this reinforces the model to select what it thinks are the easiest actions first. Conversely, the “hard” variant uses which encourages the model to place probability mass on what it thinks are the hardest valid actions. This is akin to a negative feedback system whose stationary condition is the uniform distribution.
3.6 Roll-in Policy
We follow stern-arxiv-2019 and use a uniform roll-in policy when sampling partial outputs at training time in which we first select a subset size uniformly at random, then select a random subset of the output of that size. Repeated tokens are handled via greedy left or right alignment to the true output.
|Order||En De||En Zh|
|Alphabetical (A z)||93%||87%||77%||88%||82%||73%|
|Alphabetical (z A)||90%||84%||74%||85%||78%||69%|
For our experiments, we train and evaluate models for each order on two standard machine translation datasets: WMT14 En-De and WMT18 En-Zh. For WMT14 En-De, we follow the standard setup with newstest2013 as our development set and newstest2014 as our test set. For WMT18 En-Zh, we use the official preprocessed data11 1 http://data.statmt.org/wmt18/translation-task/preprocessed/zh-en/ with no additional data normalization or filtering, taking newstest2017 to be our development set and newstest2018 our test set. En-Zh evaluation is carried out using sacreBLEU22 2 BLEU+case.mixed+lang.en-zh+numrefs.1+smooth.exp+test.wmt18+tok.zh+version.1.2.12 (post-wmt-2018). In both cases, we train all models for 1M steps using sequence-level knowledge distillation (hinton-nips-2015; kim-emnlp-2016) from a base Transformer (vaswani-nips-2017). We perform a sweep over temperatures and EOS penalties (stern-arxiv-2019) on the development set, but otherwise perform no additional hyperparameter tuning, borrowing all other model and optimization settings from the base Transformer.
4.1 Ability to Learn Different Orders
By and large, we find that the Insertion Transformer is remarkably capable of learning to generate according to whichever order it was trained for. We give example decodes for three different generation orders in Figures 3, 4, and 5. In the first example, we see that the alphabetical En-De model adheres to the Unicode ordering for Latin characters (punctuation uppercase lowercase), and that the En-Zh model similarly adheres to the Unicode order for Chinese (punctuation CJK characters sorted by radical and stroke count). In the second example, the longest-first En-De model generates subwords in decreasing order of length as expected. Finally, in the third example, the common-first En-Zh model begins with common particles and punctuation before generating the main content words.
We give a quantitative measurement of the success of each model in Table 2, computing the percentage of insertions across the development set that adhered to the best-scoring action under the desired ordering. Most models exhibit similar trends, with the majority of En-De models achieving accuracies in excess of 90% when a low temperature is used, and with corresponding results in the mid-to-upper 80% range for En-Zh. Even the random order based on token hashes has accuracies exceeding 80% for both languages, demonstrating that the model has a strong capacity to adapt to any oracle policy.
|Order||En De||En Zh|
|Alphabetical (A z)||26.86||26.58||32.7||32.5|
|Alphabetical (z A)||27.22||26.37||33.1||33.0|
4.2 Test Results
Next, we measure the quality of our models by evaluating their performance on their respective test sets. The BLEU scores are reported in Table 3. The uniform loss proposed by stern-arxiv-2019 serves as a strong baseline for both language pairs, coming within 0.6 points of the original Transformer for En-De at 26.72 BLEU, and attaining a respectable score of 33.1 BLEU on En-Zh. We note that there is a slightly larger gap between the normal Transformer and the Insertion Transformer for the latter of 2.7 points, which we hypothesize is a result of the larger discrepancy between word orders in the two languages combined with the more difficult nature of the Insertion Transformer training objective.
Most of the content-based orderings (frequency-based, length-based, alphabetical) perform comparably to the uniform loss, and even the random order is not far behind. The adaptive orders perform similarly well, with easy-first attaining one of the highest scores on En-De. Curiously, in our model adaptive easy-order, we were unable to identify any strong patterns in the generation order. The model did have a slight preference towards functional words (i.e., “,” and “der”), but the preference was weak. As for location-based losses, the binary tree loss is notable in that it achieves the highest score across all losses for both languages. On the other hand, we note that while the soft left-to-right and right-to-left losses perform substantially better than the hard loss employed in the original work by stern-arxiv-2019, performance does suffer when using parallel decoding for those models, which is generally untrue of the other orderings. We believe this is due in part to exposure bias issues arising from the monotonic ordering as compared with the uniform roll-in policy that are not shared by the other losses.
4.3 Performance vs. Sentence Length
For additional analysis, we consider how well our models perform relative to one another conditional on the length of the source sentence. Sentence length can be seen as a rough proxy measurement of the difficulty of translating a sentence. This is to determine if whether some order variations are able to achieve improved BLEU scores over other models depending on the source sentence’s length. For each sentence in the En-De and En-Zh development sets, we compute their lengths and bin them into groups of size 5, up to a maximum length of 50. Within each bin, we compute sentence-level BLEU and take the mean score across all sentences. This is done for each of our model variants. Figure 6 illustrates the results of this experiment. We observe a surprisingly small model variance across all bin lengths. This suggests that sentences that are difficult to translate are difficult across all orderings, and no particular ordering appears strictly better or worse than others. One small exception to this is a performance fall-off of hard-first orderings for very long sentences across both datasets. We also observe a different distribution of BLEU scores across bin lengths for En-De and En-Zh. In particular, En-De models are approximately monotonic-decreasing in performance as source length increases, while on En-Zh performance is roughly flat across sentence length. This also highlights the importance of taking additional diverse language pairs into consideration, as translation properties on one language pair may not be observed in others.
Ultimately, given the similarity of the development scores across sentence lengths and the test scores for the various models, we come to the surprising conclusion that for single-sentence English-German translation, generation order is relatively unimportant. However, for English-Chinese, it is unclear and we leave further analysis to future work. Under the Insertion Transformer framework, it appears order also does not matter much, however there is a 2.7 BLEU gap between the results in the Insertion Transformer and our Transformer baseline.
5 Related Work
In recent work, several insertion-based frameworks have been proposed for the generation of sequences in a non-left-to-right fashion for machine translation (stern-arxiv-2019; welleck-arxiv-2019; gu-arxiv-2019). stern-arxiv-2019 introduced the Insertion Transformer and explored uniform and balanced binary tree orders. We built upon and generalized this approach in order to explore a much broader set of orders. welleck-arxiv-2019 explored insertions using a binary-tree formulation. They also explored uniform and model-based orders, but found them to lag significantly behind their left-to-right baselines. Additionally, despite using a binary-tree formulation for generation, they did not explore tree-based orders. gu-arxiv-2019 introduced a model which did not explicitly represent the output canvas arising from insertions, but rather used an implicit representation through conditioning on the insertion sequence. They also performed an exploration of different generation orders, including random, odd-even, common-first, rare-first, and a search-adaptive order. Their search-adaptive order can be seen as a global version of our local model adaptive order, where we use the local greedy posterior as the reward function, and they use the sequence level log-probability as the reward function. Curiously, in their framework, the random order fell significantly behind the left-to-right baseline, while they showed small gains in their search adaptive order. One key difference between our work and welleck-arxiv-2019 and gu-arxiv-2019 is that we use a soft order-reward framework as opposed to teacher forcing. This might explain some of the performance differences, as our framework allows for a more flexible training objective. Additionally, since we use a uniform roll-in policy, our models may have less of a label bias problem, as they are trained to be able to continue from any partial output rather than just those arising from the target policy.
In this work, we investigated a broad array of generation orders for machine translation using an insertion-based sequence generation model, the Insertion Transformer. We found that regardless of the type of strategy selected, be it location-based, frequency-based, length-based, alphabetical, model-based, or even random, the Insertion Transformer is able to learn it with high fidelity and produce high-quality output in the selected order. This is especially true for English-German single sentence translation, in which we by and large found order to not matter. This opens a wide range of possibilities for generation tasks where monotonic orderings are not the most natural choice, and we would be excited to explore some of these areas in future work.
Full development set results for En-De translation and En-Zh translation.
|Order||English German||English Chinese|
|BLEU (+EOS)||BLEU (+EOS)||BLEU (+EOS)||BLEU (+EOS)|
|Uniform||22.39 (25.58)||24.31 (24.91)||28.6 (31.8)||30.4 (31.9)|
|Binary Tree||24.49 (25.55)||25.33 (25.70)||29.3 (31.6)||31.3 (31.9)|
|24.36 (25.43)||25.43 (25.76)||29.6 (32.0)||31.4 (32.2)|
|24.59 (25.80)||25.33 (25.80)||29.1 (32.2)||31.4 (32.3)|
|Random||23.82 (24.87)||23.97 (24.20)||28.5 (30.6)||29.4 (30.2)|
|24.03 (25.46)||24.58 (24.82)||28.6 (31.1)||30.0 (31.0)|
|24.00 (25.41)||24.68 (25.07)||28.9 (31.7)||30.4 (31.6)|
|L2R (Left-Aligned)||21.19 (24.46)||21.40 (21.57)||24.5 (30.0)||25.7 (28.3)|
|21.36 (24.02)||20.84 (21.25)||24.8 (29.8)||25.2 (27.8)|
|21.78 (24.21)||20.56 (21.11)||25.8 (29.8)||24.9 (27.6)|
|L2R (Right-Aligned)||21.77 (25.00)||22.62 (23.38)||25.6 (31.6)||27.3 (30.0)|
|21.85 (25.22)||22.78 (23.67)||25.3 (31.2)||27.0 (30.1)|
|21.01 (24.88)||22.29 (23.80)||23.5 (30.9)||25.8 (30.4)|
|R2L (Left-Aligned)||23.75 (25.04)||23.15 (23.25)||27.6 (31.4)||27.8 (28.6)|
|23.72 (25.29)||22.89 (22.89)||28.0 (31.6)||28.0 (29.3)|
|24.09 (25.64)||23.61 (23.85)||28.6 (31.9)||28.3 (29.9)|
|R2L (Right-Aligned)||19.23 (23.52)||19.70 (21.02)||21.3 (31.3)||22.3 (28.3)|
|19.56 (23.27)||20.20 (21.55)||20.9 (30.5)||21.6 (28.3)|
|20.19 (23.55)||20.84 (22.22)||20.3 (30.9)||21.5 (28.7)|
|Common First||25.20 (25.43)||25.05 (25.05)||29.9 (31.2)||30.5 (30.5)|
|25.46 (25.84)||25.76 (25.81)||30.5 (32.0)||31.1 (31.3)|
|25.30 (25.76)||25.75 (25.83)||30.4 (32.2)||31.4 (31.9)|
|Rare First||22.83 (24.30)||23.19 (23.62)||27.0 (29.5)||28.7 (29.7)|
|22.75 (24.56)||23.42 (23.99)||27.9 (30.7)||29.5 (30.5)|
|23.10 (24.79)||24.00 (24.36)||28.1 (31.2)||29.8 (31.1)|
|Shortest First||24.93 (25.55)||24.94 (25.01)||27.4 (30.3)||29.1 (30.0)|
|24.95 (25.72)||25.17 (25.28)||28.0 (30.9)||29.6 (30.8)|
|25.05 (25.85)||25.26 (25.48)||28.2 (31.4)||30.3 (31.5)|
|Longest First||23.59 (25.09)||24.24 (24.56)||29.2 (31.4)||30.5 (31.2)|
|23.53 (25.07)||24.68 (25.13)||29.2 (31.5)||31.0 (31.8)|
|24.09 (25.78)||24.93 (25.37)||29.0 (31.9)||31.1 (32.1)|
|Alphabetical (A Z a z)||24.49 (25.15)||24.87 (24.91)||29.2 (31.0)||30.1 (30.6)|
|24.61 (25.19)||24.96 (25.12)||30.1 (32.0)||30.8 (31.4)|
|24.77 (25.67)||25.45 (25.71)||29.7 (32.1)||30.7 (31.8)|
|Alphabetical (z a Z A)||24.16 (25.24)||24.56 (24.73)||29.2 (31.4)||30.3 (30.8)|
|24.19 (25.45)||24.65 (25.10)||29.3 (31.9)||30.7 (31.5)|
|24.26 (25.76)||25.02 (25.40)||29.7 (32.3)||31.0 (32.0)|
|Easy First||22.58 (24.09)||22.16 (22.63)||27.5 (30.2)||28.4 (29.7)|
|23.68 (25.08)||23.66 (24.03)||28.9 (31.6)||29.3 (30.7)|
|23.87 (25.43)||24.64 (25.26)||29.1 (31.9)||30.4 (31.7)|
|Hard First||20.01 (23.46)||23.16 (23.61)||24.7 (29.7)||28.7 (30.2)|
|20.96 (24.36)||23.76 (24.56)||25.4 (30.1)||29.1 (30.7)|
|21.97 (24.90)||24.33 (24.70)||26.4 (31.1)||29.9 (31.4)|