Abstract
Reinforcement learning (RL) is frequently used to increase performance intext generation tasks, including machine translation (MT), notably through theuse of Minimum Risk Training (MRT) and Generative Adversarial Networks (GAN).However, little is known about what and how these methods learn in the contextof MT. We prove that one of the most common RL methods for MT does not optimizethe expected reward, as well as show that other methods take an infeasibly longtime to converge. In fact, our results suggest that RL practices in MT arelikely to improve performance only where the pretrained parameters are alreadyclose to yielding the correct translation. Our findings further suggest thatobserved gains may be due to effects unrelated to the training signal, butrather from changes in the shape of the distribution curve.
Quick Read (beta)
On the Weaknesses of Reinforcement Learning
for Neural Machine Translation
Abstract
Reinforcement learning (RL) is frequently used to increase performance in text generation tasks, including machine translation (MT), notably through the use of Minimum Risk Training (MRT) and Generative Adversarial Networks (GAN). However, little is known about what and how these methods learn in the context of MT. We prove that one of the most common RL methods for MT does not optimize the expected reward, as well as show that other methods take an infeasibly long time to converge. In fact, our results suggest that RL practices in MT are likely to improve performance only where the pretrained parameters are already close to yielding the correct translation. Our findings further suggest that observed gains may be due to effects unrelated to the training signal, but rather from changes in the shape of the distribution curve.
positioning
On the Weaknesses of Reinforcement Learning
for Neural Machine Translation
Leshem Choshen^{1} Lior Fox^{1} Zohar Aizenbud^{1} ^{1}School of Computer Science and Engineering, ^{2} Department of Cognitive Sciences The Hebrew University of Jerusalem {leshem.choshen, lior.fox, zohar.aizenbud}@mail.huji.ac.il [email protected] Omri Abend^{1,2}
1 Introduction
Reinforcement learning (RL) is an appealing path for advancement in Machine Translation (MT), as it allows training systems to optimize nondifferentiable score functions, common in MT evaluation, as well as its ability to tackle the “exposure bias” (Ranzato et al., 2015) in standard training, namely that the model is not exposed during training to incorrectly generated tokens, and is thus unlikely to recover from generating such tokens at test time. These motivations have led to much interest in RL for text generation in general and MT in particular. Various policy gradient methods have been used, notably Reinforce (Williams, 1992) and variants thereof (e.g., Ranzato et al., 2015; Edunov et al., 2018) and Minimum Risk Training (MRT; e.g., Och, 2003; Shen et al., 2016). Another popular use of RL is for training GANs (Yang et al., 2018; Tevet et al., 2018). See §2. Nevertheless, despite increasing interest and strong results, little is known about what accounts for these performance gains, and the training dynamics involved.
We present the following contributions. First, our theoretical analysis shows that commonly used approximation methods are theoretically illfounded, and may converge to parameter values that do not minimize the risk, nor are local minima thereof (§2.3).
Second, using both naturalistic experiments and carefully constructed simulations, we show that performance gains observed in the literature likely stem not from making target tokens the most probable, but from unrelated effects, such as increasing the peakiness of the output distribution (i.e., the probability mass of the most probable tokens). We do so by comparing a setting where the reward is informative, vs. one where it is constant. In §4 we discuss this peakiness effect (PkE).
Third, we show that promoting the target token to be the mode is likely to take a prohibitively long time. The only case we find, where improvements are likely, is where the target token is among the first 23 most probable tokens according to the pretrained model. These findings suggest that Reinforce (§5) and CMRT (§6) are likely to improve over the pretrained model only under the best possible conditions, i.e., where the pretrained model is “nearly” correct.
We conclude by discussing other RL practices in MT which should be avoided for practical and theoretical reasons, and briefly discuss alternative RL approaches that will allow RL to tackle a larger class of errors in pretrained models (§7).
2 RL in Machine Translation
2.1 Setting and Notation
An MT system generates tokens $y=({y}_{1},\mathrm{\dots},{y}_{n})$ from a vocabulary $V$ one token at a time. The probability of generating ${y}_{i}$ given preceding tokens $$ is given by $$, where $x$ is the source sentence and $\theta $ are the model parameters. For each generated token $y$, we denote with $$ the score, or reward, for generating $y$ given $$, $x$, and the reference sentence ${y}^{(ref)}$. For brevity, we omit parameters where they are fixed within context. For simplicity, we assume $r$ does not depend on following tokens ${y}_{>i}$.
We also assume there is exactly one valid target token, as in practice MT systems are trained against a single reference translation per sentence (Schulz et al., 2018). In practice, either a tokenlevel reward is approximated using MonteCarlo methods (e.g., Yang et al., 2018), or a sentencelevel (sparse) reward is given at the end of the episode (sentence). The latter is equivalent to a uniform tokenlevel reward.
$r$ is often either the negative loglikelihood, or based on standard MT metrics, e.g., BLEU (Papineni et al., 2002). When applying RL in MT, we seek to maximize the expected reward (denoted with $R$); i.e., to find
$${\theta}^{*}=\underset{\theta}{argmax}R(\theta )=\underset{\theta}{argmax}{\mathbb{E}}_{y\sim {P}_{\theta}}[r(y)]$$  (1) 
2.2 Reinforce
For a given source sentence, and partially generated sentence $$, Reinforce (Williams, 1992) samples $k$ tokens ($k$ is a hyperparameter) $S=({y}^{(1)},\mathrm{\dots},{y}^{(k)})$ from ${P}_{\theta}$ and updates $\theta $ according to this rule:
$$\mathrm{\Delta}\theta \propto \frac{1}{k}\sum _{i=1}^{k}r({y}_{i})\nabla \mathrm{log}({P}_{\theta}({y}_{i}))$$  (2) 
The righthand side of Eq. (2) is an unbiased estimator of the gradient of the objective function, i.e., $\mathbb{E}\left[\mathrm{\Delta}\theta \right]\propto {\nabla}_{\theta}R\left(\theta \right)$. Therefore, Reinforce is performing a form of stochastic gradient ascent on $R$, and has similar formal guarantees. From here follows that if $R$ is constant with respect to $\theta $, then the expected $\mathrm{\Delta}\theta $ prescribed by Reinforce is zero. We note that $r$ may be shifted by a constant term (called a “baseline”), without affecting the optimal value for $\theta $.
Reinforce is used by a variety of works in MT, text generation, and imagetotext tasks (Liu et al., 2016; Wu et al., 2018; Rennie et al., 2017; Shetty et al., 2017; Hendricks et al., 2016) – in isolation, or as a part of training (Ranzato et al., 2015). Lately, an especially prominent use for Reinforce is adversarial training with discrete data, where another network predicts the reward (GAN). For some recent work on RL for NMT, see (Zhang et al., 2016; Li et al., 2017; Wu et al., 2017; Yu et al., 2017; Yang et al., 2018).
2.3 Minimum Risk Training
The term Minimum Risk Training (MRT) is used ambiguously in MT to refer either to the application of Reinforce to minimizing the risk (equivalently, to maximizing the expected reward, the negative loss), or more commonly to a somewhat different estimation method, which we term Contrastive MRT (CMRT) and turn now to analyzing. CMRT was proposed by Och (2003), adapted to NMT by Shen et al. (2016), and often used since (Ayana et al., 2016; Neubig, 2016; Shen et al., 2017; Edunov et al., 2018; Makarov and Clematide, 2018; Neubig et al., 2018).
The method works as follows: at each iteration, sample $k$ tokens $S=\{{y}_{1},\mathrm{\dots},{y}_{k}\}$ from ${P}_{\theta}$, and update $\theta $ according to the gradient of
$$\stackrel{~}{R}(\theta ,S)=\sum _{i=1}^{k}{Q}_{\theta ,S}({y}_{i})r({y}_{i})={\mathbb{E}}_{y\sim Q}\left[r(y)\right]$$ 
where
$${Q}_{\theta ,S}({y}_{i})=\frac{P{({y}_{i})}^{\alpha}}{{\sum}_{{y}_{j}\in S}P{({y}_{j})}^{\alpha}}$$ 
Commonly (but not universally), deduplication is performed, so $\stackrel{~}{R}$ sums over a set of unique values (Sennrich et al., 2017). This changes little in our empirical results and theoretical analysis.
Despite the resemblance in definitions of $R$ (Eq. (1)) and $\stackrel{~}{R}$ (indeed, $\stackrel{~}{R}$ is sometimes presented as an approximation of $R$), they differ in two important aspects. First, $Q$’s support is $S$, so increasing $Q({y}_{i})$ for some ${y}_{i}$ necessarily comes at the expense of $Q(y)$ for some $y\in S$. In contrast, increasing $P({y}_{i})$, as in Reinforce, may come at the expense of $P(y)$ for any $y\in V$. Second, $\alpha $ is a smoothness parameter: the closer $\alpha $ is to 0, the closer $Q$ is to be uniform.
We show in Appendix A that despite its name, CMRT does not optimize $R$, nor does it optimize $\mathbb{E}[\stackrel{~}{R}]$. That is, it may well converge to values that are not local maxima of $R$, making it theoretically illfounded.^{1}^{1} 1 Sakaguchi et al. (2017) discuss the relation between CMRT and Reinforce, claiming that CMRT is a variant of Reinforce. Appendix A shows that CMRT does not in fact optimize the same objective. However, given that CMRT is often used in practice, the strong results it yielded and the absence of theory for explaining it, we discuss it here. Given a sample $S$, the gradient of $\stackrel{~}{R}$ is given by
$$\begin{array}{cc}\hfill \nabla \stackrel{~}{R}=& \alpha \sum _{i=1}^{k}\left(Q({y}_{i})\cdot r({y}_{i})\cdot \nabla \mathrm{log}P({y}_{i})\right)\hfill \\ & {\mathbb{E}}_{Q}[r]\nabla \mathrm{log}Z(S)\hfill \end{array}$$  (3) 
where $Z(S)={\sum}_{i}P{({y}_{i})}^{\alpha}$. See Appendix B.
Comparing Equations (2) and (3), the differences between Reinforce and CMRT are reflected again. First, $\nabla \stackrel{~}{R}$ has an additional term, proportional to $\nabla \mathrm{log}Z(S)$, which yields the contrastive effect. This contrast may improve the rate of convergence since it counters the decrease of probability mass for nonsampled tokens.
Second, for a given $S$, the relative weighting of the gradients $\nabla \mathrm{log}P({y}_{i})$ is proportional to $r({y}_{i})Q({y}_{i})$, or equivalently to $r({y}_{i})P{({y}_{i})}^{\alpha}$. CMRT with deduplication sums over distinct values in $S$ (Eq. (3)), while Reinforce sums over all values. This means that the relative weight of the unique value ${y}_{i}$ is $\frac{r({y}_{i})\{{y}_{i}\in S\}}{k}$ in Reinforce. For $\alpha =1$ the expected value of these relative weights is the same, and so for $$ (as is commonly used), more weight is given to improbable tokens, which could also have a positive effect on the convergence rate.^{2}^{2} 2 Not performing deduplication results in assigning higher relative weight to highprobability tokens, which may have an adverse effect on convergence rate. For an implementation without deduplication, see THUMT (Zhang et al., 2017). However, if $\alpha $ is too close to 0, $\nabla \stackrel{~}{R}$ vanishes. This tradeoff explains the importance of tuning $\alpha $ reported in the literature. In §6 we present simulations with CMRT, showing very similar trends as presented by Reinforce.
3 Motivating Discussion
Implementing a stochastic gradient ascent, Reinforce is guaranteed to converge to a stationary point of $R$ under broad conditions. However, not much is known about its convergence rate under the prevailing conditions in NMT.
We begin with a qualitative, motivating analysis of these questions. As work on language generation empirically showed, RNNs quickly learn to output very peaky distributions (Press et al., 2017). This tendency is advantageous for generating fluent sentences with high probability, but may also entail slower convergence rates when using RL to finetune the model, because RL methods used in text generation sample from the (pretrained) policy distribution, which means they mostly sample what the pretrained model deems to be likely. Since the pretrained model (or policy) is peaky, exploration of other potentially more rewarding tokens will be limited, hampering convergence.
Intuitively, Reinforce increases the probabilities of successful (positively rewarding) observations, weighing updates by how rewarding they were. When sampling a handful of tokens in each context (source sentence $x$ and generated prefix $$), and where the number of epochs is not large, it is unlikely that more than a few unique tokens will be sampled from $$. (In practice, $k$ is typically between 1 and 20, and the number of epochs between 1 and 100.) It is thus unlikely that anything but the initially most probable candidates will be observed. Consequently, Reinforce initially raises their probabilities, even if more rewarding tokens can be found down the list.
We thus hypothesize the peakiness of the distribution, i.e., the probability mass allocated to the most probable tokens, will increase, at least in the first phase. We call this the peakinesseffect (PkE), and show it occurs both in simulations (§4.1) and in fullscale NMT experiments (§4.2).
With more iterations, the mostrewarding tokens will be eventually sampled, and gradually gain probability mass. This discussion suggests that training will be extremely sampleinefficient. We assess the rate of convergence empirically in §5, finding this to be indeed the case.
4 The Peakiness Effect
We turn to demonstrate that the initially most probable tokens will initially gain probability mass, even if they are not the most rewarding, yielding a PkE.
Caccia et al. (2018) recently observed in the context of language modeling using GANs that performance gains similar to those GAN yield can be achieved by decreasing the temperature for the prediction softmax (i.e., making it peakier). However, they proposed no account as to what causes this effect. Our findings propose an underlying mechanism leading to this trend. We return to this point in §7. Furthermore, given their findings, it is reasonable to assume that our results are relevant for RL use in other generation tasks, whose output space too is discrete, highdimensional and concentrated.
4.1 Controlled Simulations
We experiment with a 1layer softmax model, that predicts a token $i\in V$ with probability $\frac{{e}^{{\theta}_{i}}}{{\sum}_{j}{e}^{{\theta}_{j}}}$. $\theta ={\{{\theta}_{j}\}}_{j\in V}$ are the model’s parameters.
This model simulates the top of any MT decoder that ends with a softmax layer, as essentially all NMT decoders do. To make experiments realistic, we use similar parameters as those reported in the influential Transformer NMT system (Vaswani et al., 2017). Specifically, the size of $V$ (distinct BPE tokens) is 30715, and the initial values for $\theta $ were sampled from 1000 sets of logits taken from decoding the standard newstest2013 development set, using a pretrained Transformer model. The model was pretrained on WMT2015 training data (Bojar et al., 2015). Hyperparameters are reported in Appendix C. We define one of the tokens in $V$ to be the target token and denote it with ${y}_{best}$. We experiment with two reward functions:

1.
Simulated Reward: $r(y)=2$ for $y={y}_{best}$, $r(y)=1$ if $y$ is one of the 10 initially highest scoring tokens, and $r(y)=0$ otherwise. This simulates a condition where the pretrained model is of decent but suboptimal quality. $r$ here is at the scale of popular rewards used in MT, such as GANbased rewards or BLEU (which are between 0 and 1).

2.
Constant Reward: $r$ is constantly equal to 1, for all tokens. This setting is aimed to confirm that PkE is not a result of the signal carried by the reward.
Experiments with the first setting were run 100 times, each time for 50K steps, updating $\theta $ after each step. With the second setting, it is sufficient to take a single step at a time, as the expected update after each step is zero, and so any PkE seen in a single step is only accentuated in the next. It is, therefore, more telling to run more repetitions rather than more steps per initialization. We, therefore, sample 10000 pretrained distributions, and perform a single Reinforce step.
As RL training in NMT lasts about 30 epochs before stopping, samples about 100K tokens per epoch, and as the network already predicts ${y}_{best}$ in about two thirds of the contexts,^{3}^{3} 3 Based on our NMT experiments, which we assume to be representative of the error rate of other NMT systems. we estimate the number of steps used in practice to be in the order of magnitude of 1M. For visual clarity, we present figures for 50K100K steps. However, full experiments (with 1M steps) exhibit similar trends: where Reinforce was not close to converging after 50K steps, the same was true after 1M steps.
We evaluate the peakiness of a distribution in terms of the probability of the most probable token (the mode), the total probability of the ten most probable tokens, and the entropy of the distribution (lower entropy indicates more peakiness).
Results.
The distributions become peakier in terms of all three measures: on average, the mode’s probability and the 10 most probable tokens increases, and the entropy decreases. Figure 0(a) presents the histogram of the difference in the probability of the 10 most probable tokens in the Constant Reward setting, after a single step. Figure 0(b) depicts similar statistics for the mode. The average entropy in the pretrained model is 2.9 is reduced to 2.85 after one Reinforce step.
Simulated Reward setting shows similar trends. For example, entropy decreases from 3 to about 0.001 in 100K steps. This extreme decrease suggests it is effectively a deterministic policy. PkE is achieved in a few hundred steps, usually before other effects become prominent (see Figure 2), and is stronger than for Constant Reward.
4.2 NMT Experiments
We turn to analyzing a realworld application of Reinforce to NMT. Important differences between this and the previous simulations are: (1) it is rare in NMT for Reinforce to sample from the same conditional distribution more than a handful of times, given the number of source sentences $x$ and sentence prefixes $$ (contexts); and (2) in NMT $$ shares parameters between contexts, which means that updating ${P}_{\theta}$ for one context may influence ${P}_{\theta}$ for another.
We follow the same pretraining as in §4.1. We then follow Yang et al. (2018) in defining the reward function based on the expected BLEU score. Expected BLEU is computed by sampling suffixes for the sentence, and averaging the BLEU score of the sampled sentences against the reference.
We use early stopping with a patience of 10 epochs, where each epoch consists of 5000 sentences sampled from the WMT2015 (Bojar et al., 2015) GermanEnglish training data. We use $k=1$. We retuned the learningrate, and positive baseline settings against the development set. Other hyperparameters were an exact replication of the experiments reported in (Yang et al., 2018).
Results.
Results indicate an increase in the peakiness of the conditional distributions. Our results are based on a sample of 1000 contexts from the pretrained model, and another (independent) sample from the reinforced model.
Indeed, the modes of the conditional distributions tend to increase. Figure 3 presents the distribution of the modes’ probability in the reinforced conditional distributions compared with the pretrained model, showing a shift of probability mass towards higher probabilities for the mode, following RL. Another indication of the increased peakiness is the decrease in the average entropy of ${P}_{\theta}$, which was reduced from 3.45 in the pretrained model to an average of 2.82 following RL. This more modest reduction in entropy (compared to §4.1) might also suggest that the procedure did not converge to the optimal value for $\theta $, as then we would have expected the entropy to substantially drop if not to 0 (overfit), then to the average entropy of valid next tokens (given the source and a prefix of the sentence).
5 Performance following Reinforce
We now turn to assessing under what conditions it is likely that Reinforce will lead to an improvement in the performance of an NMT system. As in the previous section, we use both controlled simulations and NMT experiments.
5.1 Controlled Simulations
We use the same model and experimental setup described in Section 4.1, this time only exploring the Simulated Reward setting, as a Constant Reward is not expected to converge to any meaningful $\theta $. Results are averaged over 100 conditional distributions sampled from the pretrained model.
Caution should be exercised when determining the learning rate (LR). Common LRs used in the NMT literature are of the scale of ${10}^{4}$. However, in our simulations, no LR smaller than 0.1 yielded any improvement in $R$. We thus set the LR to be 0.1. We note that in our simulations, a higher learning rate means faster convergence as our reward is noisefree: it is always highest for the best option. In practice, increasing the learning rate may deteriorate results, as it may cause the system to overfit to the sampled instances. Indeed, when increasing the learning rate in our NMT experiments (see below) by an order of magnitude, early stopping caused the RL procedure to stop without any parameter updates.
Figure 2 shows how ${P}_{\theta}$ changes over the first 50K steps of Reinforce (probabilities are averaged over 100 repetitions), for a case where ${y}_{best}$ was initially the second, third and fourth most probable. Although these are the easiest settings for Reinforce, and despite the high learning rate, it fails to make ${y}_{best}$ the mode of the distribution within 100K steps, unless ${y}_{best}$ was initially the second most probable. In cases where ${y}_{best}$ is initially of a lower rank than four, it is hard to see any increase in its probability, even after 1M steps.
5.2 NMT Experiments
We trained an NMT system, using the same procedure as in Section 4.2, and report BLEU scores over the news2014 test set. After training with an expected BLEU reward, we indeed see a minor improvement which is consistent between trials and pretrained models. While the pretrain BLEU score is 30.31, the reinforced one is 30.73.
Analyzing what words were influenced by the RL procedure, we begin by computing the cumulative probability of the target token ${y}_{best}$ to be ranked lower than a given rank according to the pretrained model. Results (Figure 4) show that in about half of the cases, ${y}_{best}$ is not among the top three choices of the pretrained model, and we thus expect it not to gain substantial probability following Reinforce, according to our simulations.
We next turn to compare the ranks the reinforced model assigns to the target tokens, and their ranks according to the pretrained model. Figure 5 presents the difference in the probability that ${y}_{best}$ is ranked at a given rank following RL and the probability it is ranked there initially. Results indicate that indeed more target tokens are ranked first, and less second, but little consistent shift of probability mass occurs otherwise across the ten first ranks. It is possible that RL has managed to push ${y}_{best}$ in some cases between very low ranks (<1000) to mediumlow ranks (between 10 and 1000). However, token probabilities in these ranks are so low that it is unlikely to affect the system outputs in any way. This fits well with the results of our simulations that predicted that only the initially topranked tokens are likely to change.
In an attempt to explain the improved BLEU score following RL with PkE, we repeat the NMT experiment this time using a constant reward of 1. Our results present a nearly identical improvement in BLEU, achieving 30.72, and a similar pattern in the change of the target tokens’ ranks (see Appendix 8). Therefore, there is room to suspect that even in cases where RL yields an improvement in BLEU, it may partially result from rewardindependent factors, such as PkE.^{4}^{4} 4 We tried several other reward functions as well, all of which got BLEU scores of 30.73–30.84. This improvement is very stable across metrics, trials and pretrained models.
6 Experiments with Contrastive MRT
In §2.3 we showed that CMRT does not, in fact, maximize $R$, and so does not enjoy the same theoretical guarantees as Reinforce and similar policy gradient methods. However, it is still the RL procedure of choice in much recent work on NMT. We therefore repeat the simulations described in §4 and §5, assessing the performance of MRT in these conditions. We experiment with $\alpha =0.005$ and $k=20$, common settings in the literature, and average over 100 trials.
Figure 6 shows how the distribution ${P}_{\theta}$ changes over the course of 50K update steps to $\theta $, where ${y}_{best}$ is taken to be the second and third initially most probable token (Simulated Reward setting). Results are similar in trends to those obtained with Reinforce: MRT succeeds in pushing ${y}_{best}$ to be the highest ranked token if it was initially second, but struggles where it was initially ranked third or below. We only observe a small PkE in MRT. This is probably due to the contrastive effect, which means that tokens that were not sampled do not lose probability mass.
All graphs we present here allow sampling the same token more than once in each batch (i.e., $S$ is a sample with replacements). Simulations with deduplication show similar results.
7 Discussion
In this paper, we showed that the type of distributions used in NMT entail that promoting the target token to be the mode is likely to take a prohibitively long times for existing RL practices, except under the best conditions (where the pretrained model is “nearly” correct). This leads us to conclude that observed improvements from using RL for NMT are likely due either to finetuning the most probable tokens in the pretrained model (an effect which may be more easily achieved using reranking methods, and uses but little of the power of RL methods), or to effects unrelated to the signal carried by the reward, such as PkE. Another contribution of this paper is in showing that CMRT does not optimize the expected reward and is thus theoretically unmotivated.
A number of reasons lead us to believe that in our NMT experiments, improvements are not due to the reward function, but to artefacts such as PkE. First, reducing a constant baseline from $r$, so as to make the expected reward zero, disallows learning. This is surprising, as Reinforce, generally and in our simulations, converges faster where the reward is centered around zero, and so the fact that this procedure here disallows learning hints that other factors are in play. As PkE can be observed even where the reward is constant (if the expected reward is positive; see §4.1), this suggests PkE may play a role here. Second, we observe more peakiness in the reinforced model and in such cases, we expect improvements in BLEU (Caccia et al., 2018). Third, we achieve similar results with a constant reward in our NMT experiments (§5.2). Fourth, our controlled simulations show that asymptotic convergence is not reached in any but the easiest conditions (§5.1).
Our analysis further suggests that gradient clipping, sometimes used in NMT (Zhang et al., 2016), is expected to hinder convergence further. It should be avoided when using Reinforce as it violates Reinforce’s assumptions.
The pertoken sampling as done in our experiments is more exploratory than beam search (Wu et al., 2018), reducing PkE. Furthermore, the latter does not sample from the behavior policy, but does not properly account for being offpolicy in the parameter updates.
Adding the reference to the sample $S$, which some implementations allow (Sennrich et al., 2017) may help reduce the problems of never sampling the target tokens. However, as Edunov et al. (2018) point out, this practice may lower results, as it may destabilize training by leading the model to improve over outputs it cannot generalize over, as they are very different from anything the model assigns a high probability to, at the cost of other outputs.
8 Conclusion
The standard MT scenario poses several uncommon challenges for RL. First, the action space in MT problems is a highdimensional discrete space (generally in the size of the vocabulary of the target language or the product thereof for sentences). This contrasts with the more common scenario studied by contemporary RL methods, which focuses mostly on much smaller discrete action spaces (e.g., video games (Mnih et al., 2015, 2016)), or continuous action spaces of relatively low dimensions (e.g., simulation of robotic control tasks (Lillicrap et al., 2015)). Second, reward for MT is naturally very sparse – almost all possible sentences are “wrong” (hence, not rewarding) in a given context. Finally, it is common in MT to use RL for tuning a pretrained model. Using a pretrained model ameliorates the last problem. But then, these pretrained models are in general quite peaky, and because training is done onpolicy – that is, actions are being sampled from the same model being optimized – exploration is inherently limited.
Here we argued that, taken together, these challenges result in significant weaknesses for current RL practices for NMT, that may ultimately prevent them from being truly useful. At least some of these challenges have been widely studied in the RL literature, with numerous techniques developed to address them, but were not yet adopted in NLP. We turn to discuss some of them.
Offpolicy methods, in which observations are sampled from a different policy than the one being currently optimized, are prominent in RL (Watkins and Dayan, 1992; Sutton and Barto, 1998), and were also studied in the context of policy gradient methods (Degris et al., 2012; Silver et al., 2014). In principle, such methods allow learning from a more “exploratory” policy. Moreover, a key motivation for using $\alpha $ in CMRT is smoothing; offpolicy sampling allows smoothing while keeping convergence guarantees.
In its basic form, exploration in Reinforce relies on stochasticity in the actionselection (in MT, this is due to sampling). More sophisticated exploration methods have been extensively studied, for example using measures for the exploratory usefulness of states or actions (Fox et al., 2018), or relying on parameterspace noise rather than actionspace noise (Plappert et al., 2017).
For MT, an additional challenge is that even effective exploration (sampling diverse sets of observations), may not be enough, since the stateaction space is too large to be effectively covered, with almost all sentences being not rewarding. Recently, diversitybased and multigoal methods for RL were proposed to tackle similar challenges (Andrychowicz et al., 2017; Ghosh et al., 2018; Eysenbach et al., 2019). We believe the adoption of such methods is a promising path forward for the application of RL in NLP.
References
 Andrychowicz et al. (2017) Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. 2017. Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058.
 Ayana et al. (2016) Shiqi Shen Ayana, Zhiyuan Liu, and Maosong Sun. 2016. Neural headline generation with minimum risk training. arXiv preprint arXiv:1604.01904.
 Bojar et al. (2015) Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Carolina Scarton, Lucia Specia, and Marco Turchi. 2015. Findings of the 2015 workshop on statistical machine translation. In [email protected].
 Caccia et al. (2018) Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle Pineau, and Laurent Charlin. 2018. Language gans falling short. arXiv preprint arXiv:1811.02549.
 Degris et al. (2012) Thomas Degris, Martha White, and Richard S Sutton. 2012. Offpolicy actorcritic. arXiv preprint arXiv:1205.4839.
 Edunov et al. (2018) Sergey Edunov, Myle Ott, Michael Auli, David Grangier, and Marc’Aurelio Ranzato. 2018. Classical structured prediction losses for sequence to sequence learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 355–364. Association for Computational Linguistics.
 Eysenbach et al. (2019) Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. 2019. Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations.
 Fox et al. (2018) Lior Fox, Leshem Choshen, and Yonatan Loewenstein. 2018. Dora the explorer: Directed outreaching reinforcement actionselection. ICLR, abs/1804.04012.
 Ghosh et al. (2018) Dibya Ghosh, Avi Singh, Aravind Rajeswaran, Vikash Kumar, and Sergey Levine. 2018. Divideandconquer reinforcement learning. In International Conference on Learning Representations.
 Hendricks et al. (2016) Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, and Trevor Darrell. 2016. Generating visual explanations. In ECCV.
 Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
 Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris CallisonBurch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 177–180.
 Li et al. (2017) Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. 2017. Adversarial learning for neural dialogue generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2157–2169, Copenhagen, Denmark. Association for Computational Linguistics.
 Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
 Liu et al. (2016) Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. 2016. Optimization of image description metrics using policy gradient methods. CoRR, abs/1612.00370, 2.
 Makarov and Clematide (2018) Peter Makarov and Simon Clematide. 2018. Neural transitionbased string transduction for limitedresource setting in morphology. In COLING.
 Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533.
 Neubig (2016) Graham Neubig. 2016. Lexicons and minimum risk training for neural machine translation: Naistcmu at wat2016. In [email protected].
 Neubig et al. (2018) Graham Neubig, Matthias Sperber, Xinyi Wang, Matthieu Felix, Austin Matthews, Sarguna Padmanabhan, Ye Qi, Devendra Singh Sachan, Philip Arthur, Pierre Godard, John Hewitt, Rachid Riad, and Liming Wang. 2018. Xnmt: The extensible neural machine translation toolkit. In AMTA.
 Och (2003) Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational LinguisticsVolume 1, pages 160–167. Association for Computational Linguistics.
 Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
 Plappert et al. (2017) Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. 2017. Parameter space noise for exploration. arXiv preprint arXiv:1706.01905.
 Press et al. (2017) O. Press, A. Bar, B. Bogin, J. Berant, and L. Wolf. 2017. Language generation with recurrent generative adversarial networks without pretraining. In Fist Workshop on Learning to Generate Natural [email protected].
 Ranzato et al. (2015) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732.
 Rennie et al. (2017) Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Selfcritical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7008–7024.
 Sakaguchi et al. (2017) Keisuke Sakaguchi, Matt Post, and Benjamin Van Durme. 2017. Grammatical error correction with neural reinforcement learning. arXiv preprint arXiv:1707.00299.
 Schulz et al. (2018) Philip Schulz, Wilker Aziz, and Trevor Cohn. 2018. A stochastic decoder for neural machine translation. In ACL.
 Sennrich et al. (2017) Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler, Marcin JunczysDowmunt, Samuel Läubli, Antonio Valerio Miceli Barone, Jozef Mokry, and Maria Nadejde. 2017. Nematus: a toolkit for neural machine translation. In EACL.
 Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1715–1725.
 Shen et al. (2016) Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2016. Minimum risk training for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1683–1692. Association for Computational Linguistics.
 Shen et al. (2017) Shiqi Shen, Yang Liu, and Maosong Sun. 2017. Optimizing nondecomposable evaluation metrics for neural machine translation. Journal of Computer Science and Technology, 32:796–804.
 Shetty et al. (2017) Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, and Bernt Schiele. 2017. Speaking the same language: Matching machine to human captions by adversarial training. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 4155–4164. IEEE.
 Silver et al. (2014) David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. 2014. Deterministic policy gradient algorithms. In ICML.
 Sutton and Barto (1998) Richard S Sutton and Andrew G Barto. 1998. Reinforcement learning: An introduction. MIT press.
 Tevet et al. (2018) G. Tevet, G. Habib, V. Shwartz, and J. Berant. 2018. Evaluating text GANs as language models. arXiv preprint arXiv:1810.12686.
 Tieleman and Hinton (2012) Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
 Watkins and Dayan (1992) Christopher JCH Watkins and Peter Dayan. 1992. Qlearning. Machine learning, 8(34):279–292.
 Williams (1992) Ronald J Williams. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256.
 Wu et al. (2018) Lijun Wu, Fei Tian, Tao Qin, Jianhuang Lai, and TieYan Liu. 2018. A study of reinforcement learning for neural machine translation. In EMNLP.
 Wu et al. (2017) Lijun Wu, Yingce Xia, Li Zhao, Fei Tian, Tao Qin, Jianhuang Lai, and TieYan Liu. 2017. Adversarial neural machine translation. arXiv preprint arXiv:1704.06933.
 Yang et al. (2018) Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. 2018. Improving neural machine translation with conditional sequence generative adversarial nets. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1346–1355. Association for Computational Linguistics.
 Yu et al. (2017) Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852–2858.
 Zhang et al. (2017) Jiac heng Zhang, Yanzhuo Ding, Shiqi Shen, Yong Cheng, Maosong Sun, Huanbo Luan, and Yang Liu. 2017. Thumt: An open source toolkit for neural machine translation. arXiv preprint arXiv:1706.06415.
 Zhang et al. (2016) Yizhe Zhang, Zhe Gan, and Lawrence Carin. 2016. Generating text via adversarial training. In NIPS workshop on Adversarial Training, volume 21.
Appendix A Contrastive MRT does not Maximize the Expected Reward
We hereby detail a simple example where following the Contrastive MRT method (see §2.3) does not converge to the parameter value that maximizes $R$.
Let $\theta $ be a real number in $[0,0.5]$, and let ${P}_{\theta}$ be a family of distributions over three values $a,b,c$ such that:
$${P}_{{}_{\theta}}(x)=\{\begin{array}{cc}\theta \hfill & x=a\hfill \\ 2{\theta}^{2}\hfill & x=b\hfill \\ 1\theta 2{\theta}^{2}\hfill & x=c\hfill \end{array}$$ 
Let $r(a)=1,r(b)=0,r(c)=0.5$. The expected reward as a function of $\theta $ is:
$$R(\theta )=\theta +0.5(1\theta 2{\theta}^{2})$$ 
$R(\theta )$ is uniquely maximized by ${\theta}^{*}$ = 0.25.
Table 1 details the possible samples of size $k=2$, their probabilities, the corresponding $\stackrel{~}{R}$ and its gradient. Standard numerical methods show that $\mathbb{E}[\nabla \stackrel{~}{R}]$ over possible samples $S$ is positive for $\theta \in (0,\gamma )$ and negative for $\theta \in (\gamma ,0.5]$, where $\gamma \approx 0.33$. This means that for any initialization of $\theta \in (0,0.5]$, Contrastive MRT will converge to $\gamma $ if the learning rate is sufficiently small. For $\theta =0$, $\stackrel{~}{R}\equiv 0.5$, and there will be no gradient updates, so the method will converge to $\theta =0$. Neither of these values maximizes $R(\theta )$.
We further note that resorting to maximizing $\mathbb{E}[\stackrel{~}{R}]$ instead, does not maximize $R(\theta )$ either. Indeed, plotting $\mathbb{E}[\stackrel{~}{R}]$ as a function of $\theta $ for this example, yields a maximum at $\theta \approx 0.32$.
$S$  $P(S)$  $\stackrel{~}{R}$  $\nabla \stackrel{~}{R}$ 

$\{a,b\}$  $4{\theta}^{3}$  $\frac{1}{1+2\theta}$  $\frac{2}{{(1+2\theta )}^{2}}$ 
$\{a,c\}$  $2\theta (1$$\theta $${\theta}^{2})$  0.5 + $\frac{\theta}{24{\theta}^{2}}$  $\frac{1+2{\theta}^{2}}{2(12{\theta}^{2})}$ 
$\{b,c\}$  $4{\theta}^{2}(1$$\theta $$2{\theta}^{2})$  $\frac{1\theta 2{\theta}^{2}}{22\theta}$  $\frac{{\theta}^{2}3\theta}{2{(1\theta )}^{2}}$ 
$a,a$  ${\theta}^{2}$  1  0 
$b,b$  $4{\theta}^{4}$  0  0 
$c,c$  $(1$$\theta $$2{\theta}^{2}){}^{2}$  0.5  0 
Appendix B Deriving the Gradient of $\stackrel{~}{R}$
Given $S$, recall the definition of $\stackrel{~}{R}$:
$$\stackrel{~}{R}(\theta ,S)=\sum _{i=1}^{k}{Q}_{\theta ,S}({y}_{i})r({y}_{i})$$ 
Taking the deriviative w.r.t. $\theta $:
$$\sum _{i=1}^{k}r({y}_{i})\frac{\nabla P(y)\cdot \alpha P{(y)}^{\alpha 1}\cdot Z(S)\nabla Z(S)\cdot P{(y)}^{\alpha}}{Z{(S)}^{2}}=$$ 
$$\sum _{i=1}^{k}r({y}_{i})\left(\frac{\alpha \nabla P({y}_{i})}{P({y}_{i})}Q({y}_{i})\frac{\nabla Z(S)}{Z(S)}Q({y}_{i})\right)=$$ 
$$\sum _{i=1}^{k}r({y}_{i})Q({y}_{i})\left(\alpha \nabla \mathrm{log}P({y}_{i})\nabla \mathrm{log}Z(S)\right)=$$ 
$$\alpha \sum _{i=1}^{k}\left(r({y}_{i})Q({y}_{i})\nabla \mathrm{log}P({y}_{i})\right){\mathbb{E}}_{Q}[r]\nabla \mathrm{log}Z(S)$$ 
Appendix C NMT Implementation Details
True casing and tokenization were used (Koehn et al., 2007), including escaping html symbols and "" that represents a compound was changed into a separate token of =. Some preprocessing used before us converted the latter to ##AT####AT## but standard tokenizers in use process that into 11 different tokens, which overrepresents the significance of that character when BLEU is calculated. BPE (Sennrich et al., 2016) extracted 30715 tokens. For the MT experiments we used 6 layers in the encoder and the decoder. The size of the embeddings was 512. Gradient clipping was used with size of 5 for pretraining (see Discussion on why not to use it in training). We did not use attention dropout, but 0.1 residual dropout rate was used. In pretraining and training sentences of more than 50 tokens were discarded. Pretraining and training were considered finished when BLEU did not increase in the development set for 10 consecutive evaluations, and evaluation was done every 1000 and 5000 for batches of size 100 and 256 for pretraining and training respectively. Learning rate used for rmsprop (Tieleman and Hinton, 2012) was 0.01 in pretraining and for adam (Kingma and Ba, 2015) with decay was 0.005 for training. 4000 learning rate warm up steps were used. Pretraining took about 7 days with 4 GPUs, afterwards, training took roughly the same time. Monte Carlo used 20 sentence rolls per word.