Sequence Modeling of Temporal Credit Assignment for Episodic Reinforcement Learning

  • 2019-05-31 05:20:12
  • Yang Liu, Yunan Luo, Yuanyi Zhong, Xi Chen, Qiang Liu, Jian Peng
  • 3


Recent advances in deep reinforcement learning algorithms have shown greatpotential and success for solving many challenging real-world problems,including Go game and robotic applications. Usually, these algorithms need acarefully designed reward function to guide training in each time step.However, in real world, it is non-trivial to design such a reward function, andthe only signal available is usually obtained at the end of a trajectory, alsoknown as the episodic reward or return. In this work, we introduce a newalgorithm for temporal credit assignment, which learns to decompose theepisodic return back to each time-step in the trajectory using deep neuralnetworks. With this learned reward signal, the learning efficiency can besubstantially improved for episodic reinforcement learning. In particular, wefind that expressive language models such as the Transformer can be adopted forlearning the importance and the dependency of states in the trajectory,therefore providing high-quality and interpretable learned reward signals. Wehave performed extensive experiments on a set of MuJoCo continuous locomotivecontrol tasks with only episodic returns and demonstrated the effectiveness ofour algorithm.


Quick Read (beta)

Sequence Modeling of Temporal Credit Assignment for Episodic Reinforcement Learning

Yang Liu
&Yunan Luo
&Yuanyi Zhong
&Xi Chen
&Qiang Liu
UT Austin
&Jian Peng

Recent advances in deep reinforcement learning algorithms have shown great potential and success for solving many challenging real-world problems, including Go game and robotic applications. Usually, these algorithms need a carefully designed reward function to guide training in each time step. However, in real world, it is non-trivial to design such a reward function, and the only signal available is usually obtained at the end of a trajectory, also known as the episodic reward or return. In this work, we introduce a new algorithm for temporal credit assignment, which learns to decompose the episodic return back to each time-step in the trajectory using deep neural networks. With this learned reward signal, the learning efficiency can be substantially improved for episodic reinforcement learning. In particular, we find that expressive language models such as the Transformer can be adopted for learning the importance and the dependency of states in the trajectory, therefore providing high-quality and interpretable learned reward signals. We have performed extensive experiments on a set of MuJoCo continuous locomotive control tasks with only episodic returns and demonstrated the effectiveness of our algorithm.


Sequence Modeling of Temporal Credit Assignment for Episodic Reinforcement Learning

  Yang Liu UIUC Yunan Luo UIUC Yuanyi Zhong UIUC Xi Chen Qiang Liu UT Austin Jian Peng UIUC


noticebox[b]Preprint. Under review.\[email protected]

1 Introduction

Deep reinforcement learning (RL) methods, including the well-known policy gradient algorithms [24, 31, 32] and deep Q-networks [25], have shown superior performance and great potential in many difficult real-world problems, such as the Go game [33, 34], locomotive continuous control problems [21], resource management [23], and robotics [20]. The key idea of such algorithms is to use deep neural networks as functional approximators to abstract or represent complex state observation so that actions can be properly chosen accordingly to optimize a long-term expected return. The learned policy or Q function essentially captures the temporal structure of the sequential decision problem and decompose it to a supervised learning problem, guided by the reward signal. However, in many real-world problems, the reward signal is usually not dense enough to provide sufficient supervision for learning the decision at each single time step. In many practical tasks, such as the Go game and the automatic chemical design problems [27], we can only obtain a final reward or return value after finishing the entire rollout of the policy, while no intermediate reward is provided before reaching the end of the trajectory. This type of problems is also known as the episodic reinforcement learning.

Unfortunately, when the reward signal becomes delayed or even episodic, most existing deep reinforcement learning algorithms may get stuck during the training process and often suffer from inferior performance and inefficient sample complexity [8, 9]. This problem is widely known as the temporal credit assignment in reinforcement learning [39], which describes the issue of delayed rewards causing the signal to be diluted over time and only weakly affecting the states temporally distant from the time step when the rewards get collected. For example, in Go games, the only effective reward is the final win or loss. This reward is received only after finishing the entire games, usually consisting of hundreds of moves in the trajectory. During the game, though human professional players or experts may be able to decide which moves are likely to influence the final winning probability, quantitatively design of such informative rewards is very challenging. Such sparse or episodic reward signal makes the training of policies or Q-neural networks very difficult, as a lot of data is required to propagate the final win/loss reward back to earlier states or state-action pairs. In addition, since there is no immediate reward or even no short-term reward, the exploration becomes challenging without any information. Similar to Go games, there are many real-world problems in which rewards are terribly delayed or episodic, with only a non-zero reward value received at the end of the episode or the trajectory.

In this paper, we introduce a new algorithm for the temporal credit assignment problem. The idea is to learn deep neural networks that is able to decompose the episodic reward into parts and assign them back to each time step in the trajectory. After learning this decomposition, we can use the assigned dense reward signal to guide policy optimization. Formally, we derive a generalized policy gradient for decomposed reward signal. In our derivation, we find that to ensure the correctness of the policy gradient algorithms, the dependency of the neural network on time steps needs to be forward-looking, i.e. for a time t before the terminal step T, the reward function rt({st,at}t=0T)=rt({st,at}t=0t) does not depend on any future states or actions. This forward structure, frequently seen in natural language processing [18], motivates us to apply the renowned neural-network sequence models, such as language models to learning the reward decomposition. In particular, we adopt the Transformer, which satisfies the forward-looking structure, for learning the importance and the dependency of states using the self-attention mechanism. With the learned reward signal, we show on a set of MuJoCo continuous locomotive control tasks with episodic returns that the learning and sample efficiency of can be greatly improved.

2 Method

2.1 Background and Problem Statement

Reinforcement learning

Reinforcement learning considers the problem of finding an optimal policy for an agent which interacts with an environment and collects reward per action. The goal of the agent is to maximize the cumulative reward along a trajectory. Formally, this problem can be formulated as a Markov decision process (MDP) over the environment states sS and agent actions aA, under an unknown environmental dynamics defined by a transition probability T(s|s,a). The agent’s action a is selected by a conditional probability distribution πθ(a|s) parameterized by θΘ. Playing the policy repeatedly under MDP yields a trajectory τ={st,at}t=0T, where T denotes the horizon length. Each trajectory τ is associated with a reward function R(τ) which we want to optimize, that is, maxθJ(θ):=𝔼πθ[R(τ)], where the expectation 𝔼πθ is with the distribution of the unknown dynamics T(st+1|st,at) and policy πθ(at|st).

In standard RL settings, it is common to assume that the reward is a discounted sum of a set of local reward functions distributed across the time, that is, R(τ)=t=0γtrt(st,at), where the local reward signal r(s,a) is assumed to be observed immediately following the action a performed at state s, and γ[0,1) is the discount factor. This decomposition structure greatly simplifies the problem, and forms the basic assumption of popular policy gradient algorithms. The focus of this work, however, is the more difficult case when R(τ) does not have a simple or known decomposition structure priori, and is observed only after the whole trajectory τ is rolled out.

Policy Gradient There are several different types of algorithms for learning policies, including Q-learning, policy gradient and evolutionary algorithms. For the general episodic reward function R(τ), we have a basic gradient estimation derived using the likelihood ratio trick,

θJ(θ)=θ𝔼πθ[R(τ)]=𝔼πθ[R(τ)t=0Tθlogπ(at|st)]. (1)

This basic algorithm, however, often yields large variance in gradient estimation. The policy gradient theorem [38] allows us to derive a simplified formula for the case when the reward is decomposed step-wisely:

θJ(θ)=𝔼π[θlogπ(a|s)Qπ(s,a)], (2)

where Qπ(s,a)=𝔼π[t=0γtr(st,at)|s0=s,a0=a] denotes the expected return under policy π starting from state s and action a. With empirical estimate of Q^π(st,at)=jtγj-trj using rollout trajectories, then we can obtain the well-known REINFORCE policy gradient [41] as

^θJ(θ)=1Tt=0Tγtθlogπ(at|st)Q^π(st,at). (3)

Improved policy gradient methods, such as the Proximal Policy Optimization (PPO) [32, 12], are now able to provide the state-the-art performance on many problems. It uses a proximal Kullback-Leibler (KL) divergence penalty to regularize and stabilize the policy gradient. Furthermore, control variate methods, such as the generalized advantage estimation (GAE) [31], help reducing the variance of the policy gradient estimation. So far, most policy gradient methods are designated for infinite-horizon, dense reward settings, as the dense rewards can provide direct supervision for value function estimation and policy improvement in each time step.

Episodic RL For tasks with episodic rewards, usually, there is a set of terminal states. At the end of each trajectory τ, a reward is only received at the terminal state. In other words, before reaching the final state sT, rewards rt(st,at)=0 for all t<T. In many tasks, the terminal states can be predefined, or the length of the trajectories is limited. For simplicity, we omit the discount factor and assume the trajectory length is at most T. In this way, we can abuse the notation of sT to denote the last state without further confusion. Therefore, the objective of episodic reinforcement learning becomes J(θ)=𝔼πθ[R(τ)]=𝔼πθ[rT(sT)]. Note that it is not hard to add the discount factor back and/or adopt the mathematical definitions and derivations for problems with a set of terminal states.

For episodic problems, the straightforward application of policy gradient methods, including REINFORCE [41], A2C [24] and PPO [32], may suffer from sample inefficiency, as the final episodic reward would only provide the same or similar supervision for learning policy over all time steps in a trajectory. Therefore, a huge volume of rollout trajectories are required to distinguish the subtle influence of certain action on the final reward. Besides policy optimization methods, blackbox optimization approaches , such as cross entropy method [29], CMA-ES [11] and evolution strategies [30], have been also applied to episodic RL problems due to their computational efficiency.

2.2 Overview of Our Approach

We propose a new approach to learn a dense surrogate reward function that approximates the temporal credit assignment of the episodic reward (Algorithm 2.2). The idea is intuitive: we hope to find r^(st,at) approximating R(τ)=t=0Tr^(st,at), so that we can use r^ as the surrogate reward to compute the policy gradient. If r^ is dense over time step and provide sufficient information about the influence on the episodic reward, then it should help improve the sample efficiency of training policies under the episodic reward setting.


[H] Policy optimization with decomposed reward \[email protected]@algorithmic[1] \STATEInitialize: policy parameters θ0, predictor parameters ϕ0 \FORi = 1, 2, 3, … N \STATECollect a batch of trajectories using roll-outs. \STATEAppend the new trajectories into trajectory buffer for regression. \STATETrain reward predictor using gradient descent: ϕiϕi-1-γϕϕregression \STATEUpdate policy parameters using policy optimization algorithm: θiθi-1+γθJ(θ) where J(θ) is obtained with Eq. (7) . \ENDFOR\STATEOutput: policy πθN, reward predictor ϕN

Here, we consider a generalization of the time step-wise reward function. Assume a reward function r^ that is defined on states and actions over a time interval α. Then we expect that the episodic reward R(τ)=rT(sT) can be decomposed as the sum of the reward function on all intervals: αr^(sα,aα), where sα={si|iα} and aα={ai|iα}. The choice of can be very flexible: If each interval only contains a single time step, then the reward function is defined on each time step α{{0},{1},,{T}}; If contains all consecutive sub-sequences starting from time 0, then each α{{0,1,2,,t}:t=0,,T} contains all time steps from the beginning of the trajectory to the current time step t.

Therefore, the objective function of learning such reward r^ can be done by minimizing the regression loss as following:

minϕregression(ϕ):=τD(αr^ϕ(sα,aα)-R(τ))2 (4)

where r^ϕ can be a neural network that takes sα and aα as input and is parameterized by ϕ, D is a collection of trajectories. There are several critical choices: first, how one should define the interval set ; second, which model would be used for the reward function; third, how to collect the dataset D. We will discuss the design principles in the following sections and their impact in the experimental section.

2.3 Generalized Policy Gradient with Rewards on Time Intervals

Policy Gradient of Composite Reward

Now, let us assume we have learned a composite approximation of the reward function, R^(τ)=αr^(sα,aα), where r^(sα,aα) is a local reward function that defined on state and actions over time interval α. Our key idea is to leverage the decomposition structure of R^ to simplify and reduce the variance of the policy gradient formula, as we summarize in the following generalization of policy gradient theorem to the composite rewards.

Theorem 1

I) Denote by J^(θ):=Eπθ[R^(τ)] the expectation of the composite reward R^(τ), we have

J^(θ) =θ𝔼πθ[R^(τ)]=α𝔼πθ[r^(sα,aα)tΓαθlogπ(at|st)], (5)

where Γα={t:tmax(α)} and max(α) denotes the maximum element of set α; note that Γα is the set of all t that θlogπ(at|st) should multiply by r^(sα,tα).

II) Equivalently, we have

J^(θ)=𝔼πθ[t=0TQt(τ)θlogπ(at|st)], (6)

where Qt is a generalized Q-function, defined to be


Here Γt* is the set of all α whose upper bound max(α) excess t.

Theorem 1 allows us to leverage decomposition structure of R^(τ) to increase the efficiency of policy gradient. Compared to the basic gradient formula in (1), Eq. (5) keeps only the logπ(at|st) in set Γα for each local reward r^(sα,aα). This is obtained by the fact of MDP that present actions only impact the future but not the past, similar to what we have seen in the original policy gradient.

Eq. (6) in Theorem 1 can be viewed as a generalization of policy gradient theorem in Eq. 2, and Qt is similar to the definition of typical notion of Q-function, which corresponds to our special case when each α includes an individual time steps, that is, ={{0},{1},,{T}}.

By replacing the expectation in Eq. (6) with empirical average from rollout trajectories, we can derive a generalized policy gradient for the composite reward R^(τ), that is, given a set of trajectories τi={(sti,ati)t=0T}, i=1,,n, we have


where again Qt(τi)=αΓt*r^(sαi,aαi) is the generalized Q function.

Bias Correction and Control Variates

The method above works well if R^(τ) forms an accurate approximation of R(τ). However, when the approximation is poor, it introduces a significant bias into the gradient estimation and hence deteriorates the performance. To address this problem, we propose to add the residual term to correct this bias. This yields a gradient estimation of form

J(θ)=𝔼πθ[r0(τ)t=0Tlogπ(at|st)]+J^(θ). (7)

where r0(τ)=R(τ)-R^(τ) denotes the residual error, and J^(θ) is the gradient of composite reward J^(θ)=𝔼πθ[R^(τ)] in (6) and (7). This allows us to give an unbiased gradient estimation of the true expected reward J(θ), while being able to leverage the structure of the composite reward.

Theoretically, we can show that the residual corrected gradient estimation (7) can be viewed as a control variate, a widely used approach for reducing variance in policy optimization. In particular, we can show that (7) is equivalent to

J(θ)=𝔼πθ[t=0T(R(τ)-r^¬t(τ))θlogπ(at|st)] (8)

where r^¬t(τ)=αΓt*r^(sα,aα), which collects all the terms that do not appear in (5). We can show that 𝔼πθ[r^¬t(τ)θlogπ(at|st)]=0 for all t, and hence subtracting r¬t in (8) does not change the expectation of the formula. Note that r¬t is a generalization of the standard baseline, which is typically taken to be a constant, or depend only on st.

With this unbiased estimator, the policy gradient would be robust so that we no longer need to worry about whether the regression of r^ is sufficient or the dataset D is not well chosen. In practice, we find that both θJ^(θ) and θJ(θ) works reasonably well if the regression loss is sufficiently optimized.


Note that if is the set of individual time steps and R(τ)=tr^(st,at), Eq. (8) will be reduced to the classic policy gradient as shown in Eq. (2). It is probably more interesting to consider to go beyond the set of time steps to include more information. One critical observation from the above derivation is that the interval α can be as large as possible if max(α) is upper bounded by time t, if we want the supervision of r^(sα,aα) for θlogπθ(st,at). Therefore, to include as much information as possible, it is natural to see that we can define as the set of all consecutive sub-sequences starting from time 0, so that each α{{0,1,2,,t}:t=0,,T} contains all time steps from the beginning of the trajectory to the current time step t. This structure of is particularly interesting and highly resembles the forward-looking structure studied in sequence modeling in NLP, such as language modeling.

2.4 Learning Temporal Credit Assignment using Language Models

Motivated by the forward-looking structure, we propose to adopt neural-network language models for learning the reward function r^ for credit assignment. In nature language processing, a language model assigns probability to a given sequence in a language [5]. A more tangible and related model is to assign probability of an upcoming word given a sequence of prior words. More formally, a language model predicts the probability of word wt by parameterizing the conditional distribution p(wt|w1,w2,,wt-1). Such models would be very useful in many applications, especially those generating sequences as output. Notable examples including the n-gram models and the recurrent neural networks, which attempt to capture medium- to long-range dependencies in the sentence. Very recently, a model entirely based on attention mechanisms was proposed for language modeling and has achieved state-of-the-art performance on neural machine translation and other NLP tasks  [40, 7]. The model, called Transformer, has an encoder-decoder structure and is composed of stacked self-attention and fully connected layers, without using any recurrence or convolution. To attend multiple parts of the input sequence simultaneously, instead of using a single large attention “head", the Transformer uses multiple small attention heads to project the input sequence into multiple subspaces and combines the attention outputs by concatenation. Note that in Transformer, the self-attention for constructing an latent vector at a word is based on all other words prior to the current one, which highly resembles the desideratum for the reward function, as discussed above. Furthermore, it is also possible to use recurrent neural networks or convolutional neural networks to model the reward function.

In this work, we consider the Transformer network model, because of its superior performance and interpretability. Our model for learning the reward function r^ consists of two parts. The first part is an encoder module of the Transformer, which is composed of a multi-head attention layer followed by a position-wise fully-connected feed-forward layer. The second part is a self-attention layer [22], which generates a set of summation weight vectors for the encoder outputs. The set of summation weight vectors are used to multiply with the Transformer outputs, resulting in a hidden representation which is then processed by a regression layer to give the predicted rewards r^.

Specifically, suppose we have a trajectory that has n state-action pairs, represented as τ={(st,at)t=0T-1}. Here st is a ds-dimensional observation vector and at is a da-dimensional action vector. τ is thus represented as a T by (ds+da) matrix. Note that we omit the special terminal state sT here for notational simplicity, if not causing further confusion. Each state-action pair (st,at) in the trajectory is then processed by a feed forward layer, whose parameters are shared across all time steps, to give a fixed-length vector representation 𝐯t for this state-action pair. To gain the dependency between the current time step t and other time steps before t, we use a encoder layer to process the state-action pairs: 𝐡t=Transformer(𝐯0,𝐯1,,𝐯t). Let the dimension of each start-action pair representation vector 𝐯t be d. Since the Transformer network does not change the dimension of input vectors, we represent all the T representation vectors 𝐡t as a T by d matrix H=(𝐡0,𝐡1,,𝐡T-1). To summarize the information in H, we apply a self-attention mechanism:


Here Ws1 is a weight matrix with dimension dz by d and 𝐰s2 is a vector of parameters with size dz, where dz is a hyper-parameter. Vector 𝐳 has size T, and each entry zt ranges from 0 to 1, quantifying the importance of the state-action pair (st,at) in predicting the reward r(sαt,aαt) for interval αt={0,1,,t}. We combine the hidden representations in H using 𝐳 to obtain a summarized representation 𝐡t*=zt𝐡t. To predict the reward r^(sαt,aαt), we add one regression layer parameterized by 𝐰r and br and output the predicted reward as


Some other networks we consider include the feed-forward neural network and long short-term memory network (LSTM). Please see Figure A2 to see their differences.

3 Experiments

In this section, we conduct experiments to provide evidences for the following questions? (1) Is the learned reward function useful for improving policy optimization in episodic RL? (2) What are the appropriate choices of interval set , neural network model, and dataset D? Note that we also provide a case study on visualization and intepretability of the learned reward function in Appendix.

3.1 Experimental Settings

The experiments are conducted on a set of high-dimensional locomotion tasks in continuous domain using OpenAI gym [6] and MuJoCo simulation toolkits. We use the PPO algorithm introduced in [32] for all the experiments. The policy is represented by a uni-mode Gaussian distribution with diagonal covariance. The mean is parameterized by a two-layer neural network with 64 hidden units and tanh non-linearity. The log standard deviation is parameterized by a global vector. The same architecture is applied for value function approximation. In practice, we implemented the policy gradient estimation shown in Eq. (7), where we decompose into two parts: one with the predicted reward, the other with only the residual. In this way, we can use advanced variance reduction approaches to improve the estimation of each components. In our experiments, we use the generalized advantage estimation (GAE) [31] as the control variate method for variance reduction. More hyper-parameters of PPO and GAE are tabulated in Table A1 in Appendix. We also compare against baseline algorithms trained with episodic reward, respectively. The episodic reward of each rollout trajectory is defined as the accumulated original reward at all time-steps. The experiments are run for 5M timesteps with 5 random seeds. For baseline algorithms, we also consider an LSTM policy with 128 hidden units besides the aforementioned MLP policy. All experiments are conducted on NVIDIA 1080 GPUs.

Hopper Walker2d Humanoid Humanoid-Standup Swimmer
PPO (episodic) 437 266 516 44673 6
CEM 97 205 426 9.6×104 17
Ours 1462 3217 2209 82579 135
Table 1: Performance of PPO and our approach on locomotion tasks with episodic rewards. Scores taken at 5M iterations with the environment. Cross entropy method performance taken from [8].

For the reward function in our experiments, we investigate three neural network architectures, Feed-Forward (FF) Network, LSTM [13], and Transformer, to identify the best network structure that captures temporal dependency (Figure A2). For observation and action st,at at time t, an FF network, whose parameter ϕ is shared across all time steps, is used to give the predicted reward r^t for individual time steps. LSTM’s reward prediction is also conditioned on all previous timesteps, so is a function of trajectory segment s0:t,a0:t. To perform reward regression based on the LSTM/FF outputs, we aggregate the the representations using a mean-pooling layer. The hyper-parameters of LSTM and FF are also listed in A1.

For the buffer updating schemes, we proposed the following three approaches to renew the trajectory buffer for reward predicting.

  • Online (O): The buffer is implemented as a FIFO queue with length K (hyper-parameter). In each iteration, the new roll-out trajectories will be inserted into the queue and the old trajectories will be removed if the queue is too long.

  • Historical+Online (HO): Two queues are maintained: One queue is the same as the one in Online, storing the most recent rollouts by the current policy. The other queue stores the trajectories with the highest episodic returns from previous rollouts.

  • Stratified-Sampling (S): To balance the training with episodic return regression, the buffer stores a larger number of trajectories in the history. In each iteration, K trajectories are sampled to ensure the episodic return is uniformly in the five bins. The queue size L and sample queue size K are hyper-parameters.

3.2 Credit Assignment Enables Policy Optimization with Episodic Reward

Our experiments with the baseline algorithms on the MuJoCo control suite demonstrated that learning with episodic return is extremely hard both with MLP policy and LSTM policy. In all the environments we tested (Figure 1), it is not surprising that the baseline method PPO (episodic) with episodic reward performs much worse than previously reported in other papers with dense reward [32]. In most of the runs, the policy cannot make any improvement after the initial time steps.

Figure 1: Learning curves for PPO baselines and our proposed method on tasks with episodic rewards. Mean and standard deviation over 5 random seeds is plotted. The x- and y- axis represent the number of training samples (in million) and average return, respectively.

On the other hand, our proposed credit-assignment algorithm with learned return decomposition consistently achieves better performance than the episodic return baselines across all environments (Figure 1). Here we use the Transformer network for decomposing rewards and MLP for representing policy, and the HO strategy to collect trajectories for optimization. In many experiments, the policies learned by our method are able to achieve the quite reasonably good performance when the original dense reward is used for training. In environments such as the Humanoid, the Hopper and the Swimmer, with the appropriate hyper-parameters, our methods can obtain comparable performance to the policies trained with the original dense rewards, outperforming the episodic return baseline by a large margin. In addition, it also outperforms the cross-entropy method (CEM) which is suitable for the episodic RL setting (Table 1). These results indicate that the our reward decomposition framework can successfully enable stable learning and greatly improve the sample efficiency in episodic settings.

3.3 Ablation Analysis

Here we perform comparisons to check the choices of 1) three network structures for reward function; 2) strategies for data collection. We also check the utility of bias correction when the reward regression is not well fitted. We use the environment Walker2d to perform these analyses and demonstrate the results in Figure 2.

As we can see, all the three networks structures outperform the episodic baselines, and the Transformer network provides the best performance. We conjecture that the reason why Transformer performs better than LSTM is that it is easier to train, as also implied by existing NLP literature [40]. Then we check whether the data collection strategy has any impact on the performance. All three strategies seem to work reasonably well, while HO appears to be the best which indicating that the historical high-quality trajectories helps reward learning. Finally, we compare the performance of θJ(θ) and θJ^(θ) under different learning rates (10-2 and 10-3) for the regression loss. We use different learning rates essentially to adjust the quality of the reward fitting. It is easy to see that with the bias correction, the learning of θJ(θ) is more robust, indicating that bias correction is needed. Extra results on ablation analysis can be found in the appendix (Figures A3, A4).

Figure 2: Ablation analysis on choices of (a) network structures for reward function; (b) strategies for data collection; (c)-(d) methods with or without bias correction under different learning rates (lr). Environment Walker2d was used to perform the analyses. Mean and standard deviation over 5 random seeds is plotted.

4 Related Work

The optimal reward design problem [35, 36] concerns about finding the proxy reward function can obtains high expected return according to the true reward function. Inverse reward design [10] studies the opposite problem of inferring true reward from designed reward. Intrinsic motivation has been shown to improve sample efficiency of RL algorithms, for example by using information gain [15], pseudo count [4] or prediction error [37, 28] as an intrinsic bonus reward to aid exploration. [42] studies intrinsic motivation under the optimal reward design framework where the optimal intrinsic reward is learned through gradient descent. Reward shaping [26] explores the space of reward function modifications (specifically potential-based rewards) which do not change the corresponding optimal policy. Hindsight Experience Replay [2] adds additional goals and corresponding rewards to a Q-learning algorithm. Meta-learning can also be utilized to learn different objective functions for different goals [14]. Our approach is more general as we don’t assume specific goals of the agent. When only expert demonstrations are available as in imitation learning, inverse reinforcement learning can be used to recover the reward function from expert trajectories [1]. As an alternative approach for solving sparse reward problems, the auxiliary-task approaches [17] focus on improving the representation by adding extra self-supervised losses. However, our proposed method learns to decompose the episodic return as reward for policy optimization directly.

Recently, several concurrent works have explored the same direction of decomposing the episodic return. Sparse Attentive Backtracking[19] applies the attentive mechanism for network straining, while ours focuses on learning the decomposition of the reward signal. Temporal Value Transport (TVT) [16] relies on a memory reconstitutive module to retrieve past data for the visual RL problems, however, ours aims at decomposing the episodic return for general problems. As one most related work, RUDDER [3] could be viewed as a special case of our work in which consecutive time-steps were used. On the other hand, our work provides a general framework with interval rewards which justifies the applicability of language models and the correctness of generalized policy gradient beyond the simple Markovian assumption.

5 Conclusion

We presented a new algorithm for learning temporal credit assignment, which uses deep neural network (Transformer) to decompose the episodic reward back to each time step in the trajectory. The assigned dense reward signals obtained from the decomposition are then used to guide training algorithms of policies. We demonstrated that our credit assignment algorithm substantially improved the learning and sample efficiency on a set of MuJoCo continuous locomotive control tasks. The reward function learned by our algorithm can also be interpreted by an attention mechanism, potentially providing insights on identifying key state-action pairs that contribute to successful reinforcement learning.


  • [1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 1. ACM, 2004.
  • [2] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058, 2017.
  • [3] Jose A Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, and Sepp Hochreiter. Rudder: Return decomposition for delayed rewards. arXiv preprint arXiv:1806.07857, 2018.
  • [4] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.
  • [5] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.
  • [6] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  • [7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [8] Tanmay Gangwani, Qiang Liu, and Jian Peng. Learning self-imitating diverse policies. Proceedings of the International Conference on Learning Representations (ICLR), 2019.
  • [9] Yijie Guo, Junhyuk Oh, Satinder Singh, and Honglak Lee. Generative adversarial self-imitation learning. arXiv preprint arXiv:1812.00950, 2018.
  • [10] Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart J Russell, and Anca Dragan. Inverse reward design. In Advances in Neural Information Processing Systems, pages 6765–6774, 2017.
  • [11] Nikolaus Hansen and Andreas Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evolutionary computation, 9(2):159–195, 2001.
  • [12] Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017.
  • [13] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [14] Rein Houthooft, Richard Y Chen, Phillip Isola, Bradly C Stadie, Filip Wolski, Jonathan Ho, and Pieter Abbeel. Evolved policy gradients. arXiv preprint arXiv:1802.04821, 2018.
  • [15] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pages 1109–1117, 2016.
  • [16] Chia-Chun Hung, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico Carnevale, Arun Ahuja, and Greg Wayne. Optimizing agent behavior over long time scales by transporting value. arXiv preprint arXiv:1810.06721, 2018.
  • [17] Max Jaderberg, Volodymyr Mnih, Wojciech Czarnecki, Tom Schaul, Joel Z. Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. CoRR, abs/1611.05397, 2017.
  • [18] Daniel Jurafsky and James H. Martin. Speech and Language Processing (2Nd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2009.
  • [19] Nan Rosemary Ke, Anirudh Goyal, Olexa Bilaniuk, Jonathan Binas, Michael C. Mozer, Chris Pal, and Yoshua Bengio. Sparse attentive backtracking: Temporal credit assignment through reminding. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, pages 7651–7662, USA, 2018. Curran Associates Inc.
  • [20] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
  • [21] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. ICLR, 2016.
  • [22] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.
  • [23] Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. Resource management with deep reinforcement learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks, pages 50–56. ACM, 2016.
  • [24] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
  • [25] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
  • [26] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pages 278–287, 1999.
  • [27] Marcus Olivecrona, Thomas Blaschke, Ola Engkvist, and Hongming Chen. Molecular de-novo design through deep reinforcement learning. Journal of cheminformatics, 9(1):48, 2017.
  • [28] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning (ICML), volume 2017, 2017.
  • [29] Reuven Rubinstein. The cross-entropy method for combinatorial and continuous optimization. Methodology and computing in applied probability, 1(2):127–190, 1999.
  • [30] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
  • [31] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. Proceedings of the International Conference on Learning Representations (ICLR), 2016.
  • [32] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • [33] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
  • [34] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
  • [35] Satinder Singh, Richard L Lewis, and Andrew G Barto. Where do rewards come from. In Proceedings of the annual conference of the cognitive science society, pages 2601–2606, 2009.
  • [36] Jonathan Sorg, Richard L Lewis, and Satinder P Singh. Reward design via online gradient ascent. In Advances in Neural Information Processing Systems, pages 2190–2198, 2010.
  • [37] Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.
  • [38] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • [39] Richard Stuart Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts Amherst, 1984. AAI8410337.
  • [40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
  • [41] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
  • [42] Zeyu Zheng, Junhyuk Oh, and Satinder Singh. On learning intrinsic rewards for policy gradient methods. In Advances in Neural Information Processing Systems, pages 4649–4659, 2018.

Appendix A Method Overview and Network structures

Figure A1: Overview of our approach. Rollout trajectories are generated from interacting with the environment. A reward predictor is trained on the collected trajectories and episodic returns with the regression loss. Then the predicted rewards are used for policy optimization.
Figure A2: Network structures for the reward predictor: (a) Feed-forward network, (b) LSTM network, (c) Transformer network

Appendix B Proof of Theorem 1

I) Recall the standard likelihood ratio gradient formula:

θJ^(θ) =θ𝔼πθ[R^(τ)]
=α𝔼πθ[r^(sα,aα)t=0Tθlogπ(at|st)]. (9)

On the other hand, note that for any tΓα, we have


Therefore, all the pairs (α,t) with tΓα is removed in (9). This hence yields (5).

II) Eq. (6) is a simple rearrangement of Eq. (5).

θJ^(θ) =α𝔼πθ[r^(sα,aα)tΓαθlogπ(at|st)]

where (α:tΓαr^(sα,aα)=Qt(τ), matching our definition. This completes the proof.

Appendix C Hyper-parameters

Hyper-parameter Searching Values
Policy Network MLP with shape (64, 64)
PPO Batch Size 2048
PPO Mini-Batch Size 64
PPO number of epoch per iteration 5
PPO learning rate 0.0001
PPO clip range ϵ 0.2
GAE γ 0.99
GAE λ 0.95
Buffer size 50
Reward predictor learning rate 0.001
Transformer number of heads 4
Transformer layer size 64
Transformer hidden layer size 128
Transformer query/key size 32
LSTM hidden size 96
FF channels [128,128,128,256]
Table A1: The hyper-parameters we used in the episodic MuJoCo environment.

Appendix D More Results on Ablation Analysis

Figure A3: Comparison between different buffer updating methods. The x-axis denotes the number of training samples and y-axis denotes the average episodic return. The red curve represents the training curve using episodic return. The yellow, blue and green curves represent the algorithm with online buffer scheme, historical-online scheme and stratified-sampling scheme, respectively.
Figure A4: Comparison between network structures: The x- and y- axis represent the number of training samples and average return, respectively. The red curve represents the training curve using episodic return. The blue, yellow and green lines represent our method with a Transformer, LSTM and FF network structure, respectively.

Appendix E Interpretability of the Learned Reward Function

We visualize the learned agent and reward function to demonstrate that the reward predictor could attain knowledge from interacting with environment. Key state action pairs that contribute to successful reinforcement learning can also identified from the visualization . Using the Hopper environment as an example, we visualized in Figure A5 the learned temporal attentions for 1000 time steps, extracted from the last layer of the Transformer network. The first stage A corresponds to the agent starting a large jump, followed by the landing (stage B) and laying on the ground (stage C). The agent then makes a smaller hop in stage D and lands in stage E. We observed that the temporal attentions exhibit periodic behavior that coincides with the agent’s periodic motion. In the Hopper environment, the goal is to make the 2D one-legged robot move forward as fast as possible. A successfully trained agent essentially learns to hop forward periodically. Hence it is reasonable that the temporal attention and predicted reward show periodicity as the hopper jumps. Similar periodic behaviors of moving and adjusting balance were also be observed in other locomotion environments such as Humanoid.

Figure A5: (Above) Learned temporal attention in the Transformer structure shows periodicity over time in the Hopper environment. A single period consists of five stages as marked by A to E. The model gives higher attention to jumping (A, D) than landing (B, E). (Below) visualizes the learned state dependency, i.e. the key reference states (e.g. landing at t=130) predicted by the multi-head attention layer in Transformer for states (e.g. jumping at t=495, 500, 505). We observed that the learned reward function gives higher reward and attention to the agent’s jumping, as the rewards of both the jumping in the large hop (stage A) and the small hop (stage D) are relatively higher than that of landing phases (stages B, C and E).