Recent advances in deep reinforcement learning algorithms have shown greatpotential and success for solving many challenging real-world problems,including Go game and robotic applications. Usually, these algorithms need acarefully designed reward function to guide training in each time step.However, in real world, it is non-trivial to design such a reward function, andthe only signal available is usually obtained at the end of a trajectory, alsoknown as the episodic reward or return. In this work, we introduce a newalgorithm for temporal credit assignment, which learns to decompose theepisodic return back to each time-step in the trajectory using deep neuralnetworks. With this learned reward signal, the learning efficiency can besubstantially improved for episodic reinforcement learning. In particular, wefind that expressive language models such as the Transformer can be adopted forlearning the importance and the dependency of states in the trajectory,therefore providing high-quality and interpretable learned reward signals. Wehave performed extensive experiments on a set of MuJoCo continuous locomotivecontrol tasks with only episodic returns and demonstrated the effectiveness ofour algorithm.
Quick Read (beta)
Sequence Modeling of Temporal Credit Assignment for Episodic Reinforcement Learning
Recent advances in deep reinforcement learning algorithms have shown great potential and success for solving many challenging real-world problems, including Go game and robotic applications. Usually, these algorithms need a carefully designed reward function to guide training in each time step. However, in real world, it is non-trivial to design such a reward function, and the only signal available is usually obtained at the end of a trajectory, also known as the episodic reward or return. In this work, we introduce a new algorithm for temporal credit assignment, which learns to decompose the episodic return back to each time-step in the trajectory using deep neural networks. With this learned reward signal, the learning efficiency can be substantially improved for episodic reinforcement learning. In particular, we find that expressive language models such as the Transformer can be adopted for learning the importance and the dependency of states in the trajectory, therefore providing high-quality and interpretable learned reward signals. We have performed extensive experiments on a set of MuJoCo continuous locomotive control tasks with only episodic returns and demonstrated the effectiveness of our algorithm.
Sequence Modeling of Temporal Credit Assignment for Episodic Reinforcement Learning
Yang Liu UIUC Yunan Luo UIUC Yuanyi Zhong UIUC Xi Chen covariant.ai Qiang Liu UT Austin Jian Peng UIUC
noticebox[b]Preprint. Under review.\[email protected]
Deep reinforcement learning (RL) methods, including the well-known policy gradient algorithms [24, 31, 32] and deep Q-networks , have shown superior performance and great potential in many difficult real-world problems, such as the Go game [33, 34], locomotive continuous control problems , resource management , and robotics . The key idea of such algorithms is to use deep neural networks as functional approximators to abstract or represent complex state observation so that actions can be properly chosen accordingly to optimize a long-term expected return. The learned policy or Q function essentially captures the temporal structure of the sequential decision problem and decompose it to a supervised learning problem, guided by the reward signal. However, in many real-world problems, the reward signal is usually not dense enough to provide sufficient supervision for learning the decision at each single time step. In many practical tasks, such as the Go game and the automatic chemical design problems , we can only obtain a final reward or return value after finishing the entire rollout of the policy, while no intermediate reward is provided before reaching the end of the trajectory. This type of problems is also known as the episodic reinforcement learning.
Unfortunately, when the reward signal becomes delayed or even episodic, most existing deep reinforcement learning algorithms may get stuck during the training process and often suffer from inferior performance and inefficient sample complexity [8, 9]. This problem is widely known as the temporal credit assignment in reinforcement learning , which describes the issue of delayed rewards causing the signal to be diluted over time and only weakly affecting the states temporally distant from the time step when the rewards get collected. For example, in Go games, the only effective reward is the final win or loss. This reward is received only after finishing the entire games, usually consisting of hundreds of moves in the trajectory. During the game, though human professional players or experts may be able to decide which moves are likely to influence the final winning probability, quantitatively design of such informative rewards is very challenging. Such sparse or episodic reward signal makes the training of policies or Q-neural networks very difficult, as a lot of data is required to propagate the final win/loss reward back to earlier states or state-action pairs. In addition, since there is no immediate reward or even no short-term reward, the exploration becomes challenging without any information. Similar to Go games, there are many real-world problems in which rewards are terribly delayed or episodic, with only a non-zero reward value received at the end of the episode or the trajectory.
In this paper, we introduce a new algorithm for the temporal credit assignment problem. The idea is to learn deep neural networks that is able to decompose the episodic reward into parts and assign them back to each time step in the trajectory. After learning this decomposition, we can use the assigned dense reward signal to guide policy optimization. Formally, we derive a generalized policy gradient for decomposed reward signal. In our derivation, we find that to ensure the correctness of the policy gradient algorithms, the dependency of the neural network on time steps needs to be forward-looking, i.e. for a time before the terminal step , the reward function does not depend on any future states or actions. This forward structure, frequently seen in natural language processing , motivates us to apply the renowned neural-network sequence models, such as language models to learning the reward decomposition. In particular, we adopt the Transformer, which satisfies the forward-looking structure, for learning the importance and the dependency of states using the self-attention mechanism. With the learned reward signal, we show on a set of MuJoCo continuous locomotive control tasks with episodic returns that the learning and sample efficiency of can be greatly improved.
2.1 Background and Problem Statement
Reinforcement learning considers the problem of finding an optimal policy for an agent which interacts with an environment and collects reward per action. The goal of the agent is to maximize the cumulative reward along a trajectory. Formally, this problem can be formulated as a Markov decision process (MDP) over the environment states and agent actions , under an unknown environmental dynamics defined by a transition probability . The agent’s action is selected by a conditional probability distribution parameterized by . Playing the policy repeatedly under MDP yields a trajectory , where denotes the horizon length. Each trajectory is associated with a reward function which we want to optimize, that is, where the expectation is with the distribution of the unknown dynamics and policy .
In standard RL settings, it is common to assume that the reward is a discounted sum of a set of local reward functions distributed across the time, that is, where the local reward signal is assumed to be observed immediately following the action performed at state , and is the discount factor. This decomposition structure greatly simplifies the problem, and forms the basic assumption of popular policy gradient algorithms. The focus of this work, however, is the more difficult case when does not have a simple or known decomposition structure priori, and is observed only after the whole trajectory is rolled out.
Policy Gradient There are several different types of algorithms for learning policies, including Q-learning, policy gradient and evolutionary algorithms. For the general episodic reward function , we have a basic gradient estimation derived using the likelihood ratio trick,
This basic algorithm, however, often yields large variance in gradient estimation. The policy gradient theorem  allows us to derive a simplified formula for the case when the reward is decomposed step-wisely:
where denotes the expected return under policy starting from state and action . With empirical estimate of using rollout trajectories, then we can obtain the well-known REINFORCE policy gradient  as
Improved policy gradient methods, such as the Proximal Policy Optimization (PPO) [32, 12], are now able to provide the state-the-art performance on many problems. It uses a proximal Kullback-Leibler (KL) divergence penalty to regularize and stabilize the policy gradient. Furthermore, control variate methods, such as the generalized advantage estimation (GAE) , help reducing the variance of the policy gradient estimation. So far, most policy gradient methods are designated for infinite-horizon, dense reward settings, as the dense rewards can provide direct supervision for value function estimation and policy improvement in each time step.
Episodic RL For tasks with episodic rewards, usually, there is a set of terminal states. At the end of each trajectory , a reward is only received at the terminal state. In other words, before reaching the final state , rewards for all . In many tasks, the terminal states can be predefined, or the length of the trajectories is limited. For simplicity, we omit the discount factor and assume the trajectory length is at most . In this way, we can abuse the notation of to denote the last state without further confusion. Therefore, the objective of episodic reinforcement learning becomes Note that it is not hard to add the discount factor back and/or adopt the mathematical definitions and derivations for problems with a set of terminal states.
For episodic problems, the straightforward application of policy gradient methods, including REINFORCE , A2C  and PPO , may suffer from sample inefficiency, as the final episodic reward would only provide the same or similar supervision for learning policy over all time steps in a trajectory. Therefore, a huge volume of rollout trajectories are required to distinguish the subtle influence of certain action on the final reward. Besides policy optimization methods, blackbox optimization approaches , such as cross entropy method , CMA-ES  and evolution strategies , have been also applied to episodic RL problems due to their computational efficiency.
2.2 Overview of Our Approach
We propose a new approach to learn a dense surrogate reward function that approximates the temporal credit assignment of the episodic reward (Algorithm 2.2). The idea is intuitive: we hope to find approximating , so that we can use as the surrogate reward to compute the policy gradient. If is dense over time step and provide sufficient information about the influence on the episodic reward, then it should help improve the sample efficiency of training policies under the episodic reward setting.
[H] \[email protected]@algorithmic \STATEInitialize: policy parameters , predictor parameters \FOR = 1, 2, 3, … \STATECollect a batch of trajectories using roll-outs. \STATEAppend the new trajectories into trajectory buffer for regression. \STATETrain reward predictor using gradient descent: \STATEUpdate policy parameters using policy optimization algorithm: where is obtained with Eq. (7) . \ENDFOR\STATEOutput: policy , reward predictor
Here, we consider a generalization of the time step-wise reward function. Assume a reward function that is defined on states and actions over a time interval . Then we expect that the episodic reward can be decomposed as the sum of the reward function on all intervals: , where and . The choice of can be very flexible: If each interval only contains a single time step, then the reward function is defined on each time step ; If contains all consecutive sub-sequences starting from time , then each contains all time steps from the beginning of the trajectory to the current time step .
Therefore, the objective function of learning such reward can be done by minimizing the regression loss as following:
where can be a neural network that takes and as input and is parameterized by , is a collection of trajectories. There are several critical choices: first, how one should define the interval set ; second, which model would be used for the reward function; third, how to collect the dataset . We will discuss the design principles in the following sections and their impact in the experimental section.
2.3 Generalized Policy Gradient with Rewards on Time Intervals
Policy Gradient of Composite Reward
Now, let us assume we have learned a composite approximation of the reward function, , where is a local reward function that defined on state and actions over time interval . Our key idea is to leverage the decomposition structure of to simplify and reduce the variance of the policy gradient formula, as we summarize in the following generalization of policy gradient theorem to the composite rewards.
I) Denote by the expectation of the composite reward , we have
where and denotes the maximum element of set ; note that is the set of all that should multiply by .
II) Equivalently, we have
where is a generalized -function, defined to be
Here is the set of all whose upper bound excess .
Theorem 1 allows us to leverage decomposition structure of to increase the efficiency of policy gradient. Compared to the basic gradient formula in (1), Eq. (5) keeps only the in set for each local reward This is obtained by the fact of MDP that present actions only impact the future but not the past, similar to what we have seen in the original policy gradient.
Eq. (6) in Theorem 1 can be viewed as a generalization of policy gradient theorem in Eq. 2, and is similar to the definition of typical notion of -function, which corresponds to our special case when each includes an individual time steps, that is, .
By replacing the expectation in Eq. (6) with empirical average from rollout trajectories, we can derive a generalized policy gradient for the composite reward that is, given a set of trajectories , , we have
where again is the generalized function.
Bias Correction and Control Variates
The method above works well if forms an accurate approximation of . However, when the approximation is poor, it introduces a significant bias into the gradient estimation and hence deteriorates the performance. To address this problem, we propose to add the residual term to correct this bias. This yields a gradient estimation of form
where denotes the residual error, and is the gradient of composite reward in (6) and (7). This allows us to give an unbiased gradient estimation of the true expected reward , while being able to leverage the structure of the composite reward.
Theoretically, we can show that the residual corrected gradient estimation (7) can be viewed as a control variate, a widely used approach for reducing variance in policy optimization. In particular, we can show that (7) is equivalent to
where , which collects all the terms that do not appear in (5). We can show that for all , and hence subtracting in (8) does not change the expectation of the formula. Note that is a generalization of the standard baseline, which is typically taken to be a constant, or depend only on .
With this unbiased estimator, the policy gradient would be robust so that we no longer need to worry about whether the regression of is sufficient or the dataset is not well chosen. In practice, we find that both and works reasonably well if the regression loss is sufficiently optimized.
Note that if is the set of individual time steps and , Eq. (8) will be reduced to the classic policy gradient as shown in Eq. (2). It is probably more interesting to consider to go beyond the set of time steps to include more information. One critical observation from the above derivation is that the interval can be as large as possible if is upper bounded by time , if we want the supervision of for . Therefore, to include as much information as possible, it is natural to see that we can define as the set of all consecutive sub-sequences starting from time , so that each contains all time steps from the beginning of the trajectory to the current time step . This structure of is particularly interesting and highly resembles the forward-looking structure studied in sequence modeling in NLP, such as language modeling.
2.4 Learning Temporal Credit Assignment using Language Models
Motivated by the forward-looking structure, we propose to adopt neural-network language models for learning the reward function for credit assignment. In nature language processing, a language model assigns probability to a given sequence in a language . A more tangible and related model is to assign probability of an upcoming word given a sequence of prior words. More formally, a language model predicts the probability of word by parameterizing the conditional distribution . Such models would be very useful in many applications, especially those generating sequences as output. Notable examples including the n-gram models and the recurrent neural networks, which attempt to capture medium- to long-range dependencies in the sentence. Very recently, a model entirely based on attention mechanisms was proposed for language modeling and has achieved state-of-the-art performance on neural machine translation and other NLP tasks [40, 7]. The model, called Transformer, has an encoder-decoder structure and is composed of stacked self-attention and fully connected layers, without using any recurrence or convolution. To attend multiple parts of the input sequence simultaneously, instead of using a single large attention “head", the Transformer uses multiple small attention heads to project the input sequence into multiple subspaces and combines the attention outputs by concatenation. Note that in Transformer, the self-attention for constructing an latent vector at a word is based on all other words prior to the current one, which highly resembles the desideratum for the reward function, as discussed above. Furthermore, it is also possible to use recurrent neural networks or convolutional neural networks to model the reward function.
In this work, we consider the Transformer network model, because of its superior performance and interpretability. Our model for learning the reward function consists of two parts. The first part is an encoder module of the Transformer, which is composed of a multi-head attention layer followed by a position-wise fully-connected feed-forward layer. The second part is a self-attention layer , which generates a set of summation weight vectors for the encoder outputs. The set of summation weight vectors are used to multiply with the Transformer outputs, resulting in a hidden representation which is then processed by a regression layer to give the predicted rewards .
Specifically, suppose we have a trajectory that has state-action pairs, represented as . Here is a -dimensional observation vector and is a -dimensional action vector. is thus represented as a by matrix. Note that we omit the special terminal state here for notational simplicity, if not causing further confusion. Each state-action pair in the trajectory is then processed by a feed forward layer, whose parameters are shared across all time steps, to give a fixed-length vector representation for this state-action pair. To gain the dependency between the current time step and other time steps before , we use a encoder layer to process the state-action pairs: . Let the dimension of each start-action pair representation vector be . Since the Transformer network does not change the dimension of input vectors, we represent all the representation vectors as a by matrix . To summarize the information in , we apply a self-attention mechanism:
Here is a weight matrix with dimension by and is a vector of parameters with size , where is a hyper-parameter. Vector has size , and each entry ranges from 0 to 1, quantifying the importance of the state-action pair in predicting the reward for interval . We combine the hidden representations in using to obtain a summarized representation . To predict the reward , we add one regression layer parameterized by and and output the predicted reward as
Some other networks we consider include the feed-forward neural network and long short-term memory network (LSTM). Please see Figure A2 to see their differences.
In this section, we conduct experiments to provide evidences for the following questions? (1) Is the learned reward function useful for improving policy optimization in episodic RL? (2) What are the appropriate choices of interval set , neural network model, and dataset ? Note that we also provide a case study on visualization and intepretability of the learned reward function in Appendix.
3.1 Experimental Settings
The experiments are conducted on a set of high-dimensional locomotion tasks in continuous domain using OpenAI gym  and MuJoCo simulation toolkits. We use the PPO algorithm introduced in  for all the experiments. The policy is represented by a uni-mode Gaussian distribution with diagonal covariance. The mean is parameterized by a two-layer neural network with 64 hidden units and tanh non-linearity. The log standard deviation is parameterized by a global vector. The same architecture is applied for value function approximation. In practice, we implemented the policy gradient estimation shown in Eq. (7), where we decompose into two parts: one with the predicted reward, the other with only the residual. In this way, we can use advanced variance reduction approaches to improve the estimation of each components. In our experiments, we use the generalized advantage estimation (GAE)  as the control variate method for variance reduction. More hyper-parameters of PPO and GAE are tabulated in Table A1 in Appendix. We also compare against baseline algorithms trained with episodic reward, respectively. The episodic reward of each rollout trajectory is defined as the accumulated original reward at all time-steps. The experiments are run for 5M timesteps with 5 random seeds. For baseline algorithms, we also consider an LSTM policy with 128 hidden units besides the aforementioned MLP policy. All experiments are conducted on NVIDIA 1080 GPUs.
For the reward function in our experiments, we investigate three neural network architectures, Feed-Forward (FF) Network, LSTM , and Transformer, to identify the best network structure that captures temporal dependency (Figure A2). For observation and action at time , an FF network, whose parameter is shared across all time steps, is used to give the predicted reward for individual time steps. LSTM’s reward prediction is also conditioned on all previous timesteps, so is a function of trajectory segment . To perform reward regression based on the LSTM/FF outputs, we aggregate the the representations using a mean-pooling layer. The hyper-parameters of LSTM and FF are also listed in A1.
For the buffer updating schemes, we proposed the following three approaches to renew the trajectory buffer for reward predicting.
Online (O): The buffer is implemented as a FIFO queue with length (hyper-parameter). In each iteration, the new roll-out trajectories will be inserted into the queue and the old trajectories will be removed if the queue is too long.
Historical+Online (HO): Two queues are maintained: One queue is the same as the one in Online, storing the most recent rollouts by the current policy. The other queue stores the trajectories with the highest episodic returns from previous rollouts.
Stratified-Sampling (S): To balance the training with episodic return regression, the buffer stores a larger number of trajectories in the history. In each iteration, trajectories are sampled to ensure the episodic return is uniformly in the five bins. The queue size and sample queue size are hyper-parameters.
3.2 Credit Assignment Enables Policy Optimization with Episodic Reward
Our experiments with the baseline algorithms on the MuJoCo control suite demonstrated that learning with episodic return is extremely hard both with MLP policy and LSTM policy. In all the environments we tested (Figure 1), it is not surprising that the baseline method PPO (episodic) with episodic reward performs much worse than previously reported in other papers with dense reward . In most of the runs, the policy cannot make any improvement after the initial time steps.
On the other hand, our proposed credit-assignment algorithm with learned return decomposition consistently achieves better performance than the episodic return baselines across all environments (Figure 1). Here we use the Transformer network for decomposing rewards and MLP for representing policy, and the HO strategy to collect trajectories for optimization. In many experiments, the policies learned by our method are able to achieve the quite reasonably good performance when the original dense reward is used for training. In environments such as the Humanoid, the Hopper and the Swimmer, with the appropriate hyper-parameters, our methods can obtain comparable performance to the policies trained with the original dense rewards, outperforming the episodic return baseline by a large margin. In addition, it also outperforms the cross-entropy method (CEM) which is suitable for the episodic RL setting (Table 1). These results indicate that the our reward decomposition framework can successfully enable stable learning and greatly improve the sample efficiency in episodic settings.
3.3 Ablation Analysis
Here we perform comparisons to check the choices of 1) three network structures for reward function; 2) strategies for data collection. We also check the utility of bias correction when the reward regression is not well fitted. We use the environment Walker2d to perform these analyses and demonstrate the results in Figure 2.
As we can see, all the three networks structures outperform the episodic baselines, and the Transformer network provides the best performance. We conjecture that the reason why Transformer performs better than LSTM is that it is easier to train, as also implied by existing NLP literature . Then we check whether the data collection strategy has any impact on the performance. All three strategies seem to work reasonably well, while HO appears to be the best which indicating that the historical high-quality trajectories helps reward learning. Finally, we compare the performance of and under different learning rates ( and ) for the regression loss. We use different learning rates essentially to adjust the quality of the reward fitting. It is easy to see that with the bias correction, the learning of is more robust, indicating that bias correction is needed. Extra results on ablation analysis can be found in the appendix (Figures A3, A4).
4 Related Work
The optimal reward design problem [35, 36] concerns about finding the proxy reward function can obtains high expected return according to the true reward function. Inverse reward design  studies the opposite problem of inferring true reward from designed reward. Intrinsic motivation has been shown to improve sample efficiency of RL algorithms, for example by using information gain , pseudo count  or prediction error [37, 28] as an intrinsic bonus reward to aid exploration.  studies intrinsic motivation under the optimal reward design framework where the optimal intrinsic reward is learned through gradient descent. Reward shaping  explores the space of reward function modifications (specifically potential-based rewards) which do not change the corresponding optimal policy. Hindsight Experience Replay  adds additional goals and corresponding rewards to a Q-learning algorithm. Meta-learning can also be utilized to learn different objective functions for different goals . Our approach is more general as we don’t assume specific goals of the agent. When only expert demonstrations are available as in imitation learning, inverse reinforcement learning can be used to recover the reward function from expert trajectories . As an alternative approach for solving sparse reward problems, the auxiliary-task approaches  focus on improving the representation by adding extra self-supervised losses. However, our proposed method learns to decompose the episodic return as reward for policy optimization directly.
Recently, several concurrent works have explored the same direction of decomposing the episodic return. Sparse Attentive Backtracking applies the attentive mechanism for network straining, while ours focuses on learning the decomposition of the reward signal. Temporal Value Transport (TVT)  relies on a memory reconstitutive module to retrieve past data for the visual RL problems, however, ours aims at decomposing the episodic return for general problems. As one most related work, RUDDER  could be viewed as a special case of our work in which consecutive time-steps were used. On the other hand, our work provides a general framework with interval rewards which justifies the applicability of language models and the correctness of generalized policy gradient beyond the simple Markovian assumption.
We presented a new algorithm for learning temporal credit assignment, which uses deep neural network (Transformer) to decompose the episodic reward back to each time step in the trajectory. The assigned dense reward signals obtained from the decomposition are then used to guide training algorithms of policies. We demonstrated that our credit assignment algorithm substantially improved the learning and sample efficiency on a set of MuJoCo continuous locomotive control tasks. The reward function learned by our algorithm can also be interpreted by an attention mechanism, potentially providing insights on identifying key state-action pairs that contribute to successful reinforcement learning.
-  Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 1. ACM, 2004.
-  Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058, 2017.
-  Jose A Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, and Sepp Hochreiter. Rudder: Return decomposition for delayed rewards. arXiv preprint arXiv:1806.07857, 2018.
-  Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.
-  Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.
-  Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
-  Tanmay Gangwani, Qiang Liu, and Jian Peng. Learning self-imitating diverse policies. Proceedings of the International Conference on Learning Representations (ICLR), 2019.
-  Yijie Guo, Junhyuk Oh, Satinder Singh, and Honglak Lee. Generative adversarial self-imitation learning. arXiv preprint arXiv:1812.00950, 2018.
-  Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart J Russell, and Anca Dragan. Inverse reward design. In Advances in Neural Information Processing Systems, pages 6765–6774, 2017.
-  Nikolaus Hansen and Andreas Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evolutionary computation, 9(2):159–195, 2001.
-  Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017.
-  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  Rein Houthooft, Richard Y Chen, Phillip Isola, Bradly C Stadie, Filip Wolski, Jonathan Ho, and Pieter Abbeel. Evolved policy gradients. arXiv preprint arXiv:1802.04821, 2018.
-  Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pages 1109–1117, 2016.
-  Chia-Chun Hung, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico Carnevale, Arun Ahuja, and Greg Wayne. Optimizing agent behavior over long time scales by transporting value. arXiv preprint arXiv:1810.06721, 2018.
-  Max Jaderberg, Volodymyr Mnih, Wojciech Czarnecki, Tom Schaul, Joel Z. Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. CoRR, abs/1611.05397, 2017.
-  Daniel Jurafsky and James H. Martin. Speech and Language Processing (2Nd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2009.
-  Nan Rosemary Ke, Anirudh Goyal, Olexa Bilaniuk, Jonathan Binas, Michael C. Mozer, Chris Pal, and Yoshua Bengio. Sparse attentive backtracking: Temporal credit assignment through reminding. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, pages 7651–7662, USA, 2018. Curran Associates Inc.
-  Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
-  Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. ICLR, 2016.
-  Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.
-  Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. Resource management with deep reinforcement learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks, pages 50–56. ACM, 2016.
-  Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
-  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
-  Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pages 278–287, 1999.
-  Marcus Olivecrona, Thomas Blaschke, Ola Engkvist, and Hongming Chen. Molecular de-novo design through deep reinforcement learning. Journal of cheminformatics, 9(1):48, 2017.
-  Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning (ICML), volume 2017, 2017.
-  Reuven Rubinstein. The cross-entropy method for combinatorial and continuous optimization. Methodology and computing in applied probability, 1(2):127–190, 1999.
-  Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
-  John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. Proceedings of the International Conference on Learning Representations (ICLR), 2016.
-  John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
-  David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
-  David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
-  Satinder Singh, Richard L Lewis, and Andrew G Barto. Where do rewards come from. In Proceedings of the annual conference of the cognitive science society, pages 2601–2606, 2009.
-  Jonathan Sorg, Richard L Lewis, and Satinder P Singh. Reward design via online gradient ascent. In Advances in Neural Information Processing Systems, pages 2190–2198, 2010.
-  Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.
-  Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
-  Richard Stuart Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts Amherst, 1984. AAI8410337.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
-  Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
-  Zeyu Zheng, Junhyuk Oh, and Satinder Singh. On learning intrinsic rewards for policy gradient methods. In Advances in Neural Information Processing Systems, pages 4649–4659, 2018.
Appendix A Method Overview and Network structures
Appendix B Proof of Theorem 1
I) Recall the standard likelihood ratio gradient formula:
On the other hand, note that for any , we have
Appendix C Hyper-parameters
|Policy Network||MLP with shape (64, 64)|
|PPO Batch Size||2048|
|PPO Mini-Batch Size||64|
|PPO number of epoch per iteration||5|
|PPO learning rate||0.0001|
|PPO clip range||0.2|
|Reward predictor learning rate|
|Transformer number of heads||4|
|Transformer layer size||64|
|Transformer hidden layer size||128|
|Transformer query/key size||32|
|LSTM hidden size||96|
Appendix D More Results on Ablation Analysis
Appendix E Interpretability of the Learned Reward Function
We visualize the learned agent and reward function to demonstrate that the reward predictor could attain knowledge from interacting with environment. Key state action pairs that contribute to successful reinforcement learning can also identified from the visualization . Using the Hopper environment as an example, we visualized in Figure A5 the learned temporal attentions for 1000 time steps, extracted from the last layer of the Transformer network. The first stage A corresponds to the agent starting a large jump, followed by the landing (stage B) and laying on the ground (stage C). The agent then makes a smaller hop in stage D and lands in stage E. We observed that the temporal attentions exhibit periodic behavior that coincides with the agent’s periodic motion. In the Hopper environment, the goal is to make the 2D one-legged robot move forward as fast as possible. A successfully trained agent essentially learns to hop forward periodically. Hence it is reasonable that the temporal attention and predicted reward show periodicity as the hopper jumps. Similar periodic behaviors of moving and adjusting balance were also be observed in other locomotion environments such as Humanoid.