Abstract
Recent advances in deep reinforcement learning algorithms have shown greatpotential and success for solving many challenging realworld problems,including Go game and robotic applications. Usually, these algorithms need acarefully designed reward function to guide training in each time step.However, in real world, it is nontrivial to design such a reward function, andthe only signal available is usually obtained at the end of a trajectory, alsoknown as the episodic reward or return. In this work, we introduce a newalgorithm for temporal credit assignment, which learns to decompose theepisodic return back to each timestep in the trajectory using deep neuralnetworks. With this learned reward signal, the learning efficiency can besubstantially improved for episodic reinforcement learning. In particular, wefind that expressive language models such as the Transformer can be adopted forlearning the importance and the dependency of states in the trajectory,therefore providing highquality and interpretable learned reward signals. Wehave performed extensive experiments on a set of MuJoCo continuous locomotivecontrol tasks with only episodic returns and demonstrated the effectiveness ofour algorithm.
Quick Read (beta)
Sequence Modeling of Temporal Credit Assignment for Episodic Reinforcement Learning
Abstract
Recent advances in deep reinforcement learning algorithms have shown great potential and success for solving many challenging realworld problems, including Go game and robotic applications. Usually, these algorithms need a carefully designed reward function to guide training in each time step. However, in real world, it is nontrivial to design such a reward function, and the only signal available is usually obtained at the end of a trajectory, also known as the episodic reward or return. In this work, we introduce a new algorithm for temporal credit assignment, which learns to decompose the episodic return back to each timestep in the trajectory using deep neural networks. With this learned reward signal, the learning efficiency can be substantially improved for episodic reinforcement learning. In particular, we find that expressive language models such as the Transformer can be adopted for learning the importance and the dependency of states in the trajectory, therefore providing highquality and interpretable learned reward signals. We have performed extensive experiments on a set of MuJoCo continuous locomotive control tasks with only episodic returns and demonstrated the effectiveness of our algorithm.
Sequence Modeling of Temporal Credit Assignment for Episodic Reinforcement Learning
Yang Liu UIUC Yunan Luo UIUC Yuanyi Zhong UIUC Xi Chen covariant.ai Qiang Liu UT Austin Jian Peng UIUC
noticebox[b]Preprint. Under review.\[email protected]
1 Introduction
Deep reinforcement learning (RL) methods, including the wellknown policy gradient algorithms [24, 31, 32] and deep Qnetworks [25], have shown superior performance and great potential in many difficult realworld problems, such as the Go game [33, 34], locomotive continuous control problems [21], resource management [23], and robotics [20]. The key idea of such algorithms is to use deep neural networks as functional approximators to abstract or represent complex state observation so that actions can be properly chosen accordingly to optimize a longterm expected return. The learned policy or Q function essentially captures the temporal structure of the sequential decision problem and decompose it to a supervised learning problem, guided by the reward signal. However, in many realworld problems, the reward signal is usually not dense enough to provide sufficient supervision for learning the decision at each single time step. In many practical tasks, such as the Go game and the automatic chemical design problems [27], we can only obtain a final reward or return value after finishing the entire rollout of the policy, while no intermediate reward is provided before reaching the end of the trajectory. This type of problems is also known as the episodic reinforcement learning.
Unfortunately, when the reward signal becomes delayed or even episodic, most existing deep reinforcement learning algorithms may get stuck during the training process and often suffer from inferior performance and inefficient sample complexity [8, 9]. This problem is widely known as the temporal credit assignment in reinforcement learning [39], which describes the issue of delayed rewards causing the signal to be diluted over time and only weakly affecting the states temporally distant from the time step when the rewards get collected. For example, in Go games, the only effective reward is the final win or loss. This reward is received only after finishing the entire games, usually consisting of hundreds of moves in the trajectory. During the game, though human professional players or experts may be able to decide which moves are likely to influence the final winning probability, quantitatively design of such informative rewards is very challenging. Such sparse or episodic reward signal makes the training of policies or Qneural networks very difficult, as a lot of data is required to propagate the final win/loss reward back to earlier states or stateaction pairs. In addition, since there is no immediate reward or even no shortterm reward, the exploration becomes challenging without any information. Similar to Go games, there are many realworld problems in which rewards are terribly delayed or episodic, with only a nonzero reward value received at the end of the episode or the trajectory.
In this paper, we introduce a new algorithm for the temporal credit assignment problem. The idea is to learn deep neural networks that is able to decompose the episodic reward into parts and assign them back to each time step in the trajectory. After learning this decomposition, we can use the assigned dense reward signal to guide policy optimization. Formally, we derive a generalized policy gradient for decomposed reward signal. In our derivation, we find that to ensure the correctness of the policy gradient algorithms, the dependency of the neural network on time steps needs to be forwardlooking, i.e. for a time $t$ before the terminal step $T$, the reward function ${r}_{t}({\{{s}_{{t}^{\prime}},{a}_{t}^{\prime}\}}_{{t}^{\prime}=0}^{T})={r}_{t}({\{{s}_{{t}^{\prime}},{a}_{{t}^{\prime}}\}}_{{t}^{\prime}=0}^{t})$ does not depend on any future states or actions. This forward structure, frequently seen in natural language processing [18], motivates us to apply the renowned neuralnetwork sequence models, such as language models to learning the reward decomposition. In particular, we adopt the Transformer, which satisfies the forwardlooking structure, for learning the importance and the dependency of states using the selfattention mechanism. With the learned reward signal, we show on a set of MuJoCo continuous locomotive control tasks with episodic returns that the learning and sample efficiency of can be greatly improved.
2 Method
2.1 Background and Problem Statement
Reinforcement learning
Reinforcement learning considers the problem of finding an optimal policy for an agent which interacts with an environment and collects reward per action. The goal of the agent is to maximize the cumulative reward along a trajectory. Formally, this problem can be formulated as a Markov decision process (MDP) over the environment states $s\in S$ and agent actions $a\in A$, under an unknown environmental dynamics defined by a transition probability $T({s}^{\prime}s,a)$. The agent’s action ${a}$ is selected by a conditional probability distribution ${{\pi}}_{{\theta}}{}{(}{a}{}{s}{)}$ parameterized by ${\theta}{\in}{\mathrm{\Theta}}$. Playing the policy repeatedly under MDP yields a trajectory ${\tau}{=}{{\{}{{s}}_{{t}}{,}{{a}}_{{t}}{\}}}_{{t}{=}{0}}^{{T}}$, where ${T}$ denotes the horizon length. Each trajectory ${\tau}$ is associated with a reward function ${R}{}{(}{\tau}{)}$ which we want to optimize, that is, ${{\mathrm{max}}}_{{\theta}}{}{J}{}{(}{\theta}{)}{:=}{{\mathbb{E}}}_{{{\pi}}_{{\theta}}}{}{\left[}{R}{}{(}{\tau}{)}{\right]}{,}$ where the expectation ${{\mathbb{E}}}_{{{\pi}}_{{\theta}}}$ is with the distribution of the unknown dynamics ${T}{}{(}{{s}}_{{t}{+}{1}}{}{{s}}_{{t}}{,}{{a}}_{{t}}{)}$ and policy ${{\pi}}_{{\theta}}{}{(}{{a}}_{{t}}{}{{s}}_{{t}}{)}$.
In standard RL settings, it is common to assume that the reward is a discounted sum of a set of local reward functions distributed across the time, that is, $R(\tau )={\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}{r}_{t}({s}_{t},{a}_{t}),$ where the local reward signal $r(s,a)$ is assumed to be observed immediately following the action $a$ performed at state $s$, and $\gamma \in [0,1)$ is the discount factor. This decomposition structure greatly simplifies the problem, and forms the basic assumption of popular policy gradient algorithms. The focus of this work, however, is the more difficult case when $R(\tau )$ does not have a simple or known decomposition structure priori, and is observed only after the whole trajectory $\tau $ is rolled out.
Policy Gradient There are several different types of algorithms for learning policies, including Qlearning, policy gradient and evolutionary algorithms. For the general episodic reward function $R(\tau )$, we have a basic gradient estimation derived using the likelihood ratio trick,
$\begin{array}{cc}\hfill {\nabla}_{\theta}J(\theta )& ={\nabla}_{\theta}{\mathbb{E}}_{{\pi}_{\theta}}[R(\tau )]={\mathbb{E}}_{{\pi}_{\theta}}\left[R(\tau ){\displaystyle \sum _{t=0}^{T}}{\nabla}_{\theta}\mathrm{log}\pi ({a}_{t}{s}_{t})\right].\hfill \end{array}$  (1) 
This basic algorithm, however, often yields large variance in gradient estimation. The policy gradient theorem [38] allows us to derive a simplified formula for the case when the reward is decomposed stepwisely:
$${\nabla}_{\theta}J(\theta )={\mathbb{E}}_{\pi}\left[{\nabla}_{\theta}\mathrm{log}\pi (as){Q}^{\pi}(s,a)\right],$$  (2) 
where ${Q}^{\pi}(s,a)={\mathbb{E}}_{\pi}[{\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}r({s}_{t},{a}_{t}){s}_{0}=s,{a}_{0}=a]$ denotes the expected return under policy $\pi $ starting from state $s$ and action $a$. With empirical estimate of ${\widehat{{Q}}}^{{\pi}}{}{(}{{s}}_{{t}}{,}{{a}}_{{t}}{)}{=}{{\sum}}_{{j}{\ge}{t}}{{\gamma}}^{{j}{}{t}}{}{{r}}_{{j}}$ using rollout trajectories, then we can obtain the wellknown REINFORCE policy gradient [41] as
$${\widehat{\nabla}}_{\theta}J(\theta )=\frac{1}{T}\sum _{t=0}^{T}{\gamma}^{t}{\nabla}_{\theta}\mathrm{log}\pi ({a}_{t}{s}_{t}){\widehat{Q}}^{\pi}({s}_{t},{a}_{t}).$$  (3) 
Improved policy gradient methods, such as the Proximal Policy Optimization (PPO) [32, 12], are now able to provide the statetheart performance on many problems. It uses a proximal KullbackLeibler (KL) divergence penalty to regularize and stabilize the policy gradient. Furthermore, control variate methods, such as the generalized advantage estimation (GAE) [31], help reducing the variance of the policy gradient estimation. So far, most policy gradient methods are designated for infinitehorizon, dense reward settings, as the dense rewards can provide direct supervision for value function estimation and policy improvement in each time step.
Episodic RL For tasks with episodic rewards, usually, there is a set of terminal states. At the end of each trajectory ${\tau}$, a reward is only received at the terminal state. In other words, before reaching the final state ${{s}}_{{T}}$, rewards ${{r}}_{{t}}{\mathrm{}}{\mathrm{(}}{{s}}_{{t}}{\mathrm{,}}{{a}}_{{t}}{\mathrm{)}}{\mathrm{=}}{\mathrm{0}}$ for all $$. In many tasks, the terminal states can be predefined, or the length of the trajectories is limited. For simplicity, we omit the discount factor and assume the trajectory length is at most ${T}$. In this way, we can abuse the notation of ${{s}}_{{T}}$ to denote the last state without further confusion. Therefore, the objective of episodic reinforcement learning becomes ${J}{\mathrm{}}{\mathrm{(}}{\theta}{\mathrm{)}}{\mathrm{=}}{{\mathbb{E}}}_{{{\pi}}_{{\theta}}}{\mathrm{}}{\mathrm{\left[}}{R}{\mathrm{}}{\mathrm{(}}{\tau}{\mathrm{)}}{\mathrm{\right]}}{\mathrm{=}}{{\mathbb{E}}}_{{{\pi}}_{{\theta}}}{\mathrm{}}{\mathrm{\left[}}{{r}}_{{T}}{\mathrm{}}{\mathrm{(}}{{s}}_{{T}}{\mathrm{)}}{\mathrm{\right]}}{\mathrm{.}}$ Note that it is not hard to add the discount factor back and/or adopt the mathematical definitions and derivations for problems with a set of terminal states.
For episodic problems, the straightforward application of policy gradient methods, including REINFORCE [41], A2C [24] and PPO [32], may suffer from sample inefficiency, as the final episodic reward would only provide the same or similar supervision for learning policy over all time steps in a trajectory. Therefore, a huge volume of rollout trajectories are required to distinguish the subtle influence of certain action on the final reward. Besides policy optimization methods, blackbox optimization approaches , such as cross entropy method [29], CMAES [11] and evolution strategies [30], have been also applied to episodic RL problems due to their computational efficiency.
2.2 Overview of Our Approach
We propose a new approach to learn a dense surrogate reward function that approximates the temporal credit assignment of the episodic reward (Algorithm 2.2). The idea is intuitive: we hope to find $\widehat{r}({s}_{t},{a}_{t})$ approximating $R(\tau )={\sum}_{t=0}^{T}\widehat{r}({s}_{t},{a}_{t})$, so that we can use $\widehat{r}$ as the surrogate reward to compute the policy gradient. If $\widehat{r}$ is dense over time step and provide sufficient information about the influence on the episodic reward, then it should help improve the sample efficiency of training policies under the episodic reward setting.
[H] \[email protected]@algorithmic[1] \STATEInitialize: policy parameters ${\theta}_{\mathrm{0}}$, predictor parameters ${\varphi}_{\mathrm{0}}$ \FOR$i$ = 1, 2, 3, … $N$ \STATECollect a batch of trajectories using rollouts. \STATEAppend the new trajectories into trajectory buffer for regression. \STATETrain reward predictor using gradient descent: ${\varphi}_{i}\leftarrow {\varphi}_{i1}{\gamma}_{\varphi}{\nabla}_{\varphi}{\mathcal{L}}_{regression}$ \STATEUpdate policy parameters using policy optimization algorithm: ${\theta}_{i}\leftarrow {\theta}_{i1}+{\gamma}_{\theta}\nabla J(\theta )$ where $\nabla J(\theta )$ is obtained with Eq. (7) . \ENDFOR\STATEOutput: policy ${\pi}_{{\theta}_{N}}$, reward predictor ${\varphi}_{N}$
Here, we consider a generalization of the time stepwise reward function. Assume a reward function $\widehat{r}$ that is defined on states and actions over a time interval $\alpha \in \mathcal{I}$. Then we expect that the episodic reward $R(\tau )={r}_{T}({s}_{T})$ can be decomposed as the sum of the reward function on all intervals: ${\sum}_{\alpha \in \mathcal{I}}\widehat{r}({s}_{\alpha},{a}_{\alpha})$, where ${s}_{\alpha}=\{{s}_{i}i\in \alpha \}$ and ${a}_{\alpha}=\{{a}_{i}i\in \alpha \}$. The choice of $\mathcal{I}$ can be very flexible: If each interval only contains a single time step, then the reward function is defined on each time step $\alpha \in \{\{0\},\{1\},\mathrm{\dots},\{T\}\}$; If $\mathcal{I}$ contains all consecutive subsequences starting from time $0$, then each $\alpha \in \{\{0,1,2,\mathrm{\dots},t\}:\forall t=0,\mathrm{\dots},T\}$ contains all time steps from the beginning of the trajectory to the current time step $t$.
Therefore, the objective function of learning such reward $\widehat{r}$ can be done by minimizing the regression loss as following:
$$\underset{\varphi}{\mathrm{min}}{\mathcal{L}}_{regression}(\varphi ):=\sum _{\tau \in D}{(\sum _{\alpha \in \mathcal{I}}{\widehat{r}}_{\varphi}({s}_{\alpha},{a}_{\alpha})R(\tau ))}^{2}$$  (4) 
where ${\widehat{r}}_{\varphi}$ can be a neural network that takes ${s}_{\alpha}$ and ${a}_{\alpha}$ as input and is parameterized by $\varphi $, $D$ is a collection of trajectories. There are several critical choices: first, how one should define the interval set $\mathcal{I}$; second, which model would be used for the reward function; third, how to collect the dataset $D$. We will discuss the design principles in the following sections and their impact in the experimental section.
2.3 Generalized Policy Gradient with Rewards on Time Intervals
Policy Gradient of Composite Reward
Now, let us assume we have learned a composite approximation of the reward function, $\widehat{R}(\tau )={\sum}_{\alpha \in \mathcal{I}}\widehat{r}({s}_{\alpha},{a}_{\alpha})$, where $\widehat{r}({s}_{\alpha},{a}_{\alpha})$ is a local reward function that defined on state and actions over time interval $\alpha \in \mathcal{I}$. Our key idea is to leverage the decomposition structure of $\widehat{R}$ to simplify and reduce the variance of the policy gradient formula, as we summarize in the following generalization of policy gradient theorem to the composite rewards.
Theorem 1
I) Denote by $\widehat{J}\mathit{}\mathrm{(}\theta \mathrm{)}\mathrm{:=}{\mathrm{E}}_{{\pi}_{\theta}}\mathit{}\mathrm{[}\widehat{R}\mathit{}\mathrm{(}\tau \mathrm{)}\mathrm{]}$ the expectation of the composite reward $\widehat{R}\mathit{}\mathrm{(}\tau \mathrm{)}$, we have
$\nabla \widehat{J}(\theta )$  $={\nabla}_{\theta}{\mathbb{E}}_{{\pi}_{\theta}}[\widehat{R}(\tau )]={\displaystyle \sum _{\alpha \in \mathcal{I}}}{\mathbb{E}}_{{\pi}_{\theta}}\left[\widehat{r}({s}_{\alpha},{a}_{\alpha}){\displaystyle \sum _{t\in {\mathrm{\Gamma}}_{\alpha}}}{\nabla}_{\theta}\mathrm{log}\pi ({a}_{t}{s}_{t})\right],$  (5) 
where ${\mathrm{\Gamma}}_{\alpha}\mathrm{=}\mathrm{\{}t\mathrm{:}t\mathrm{\le}\mathrm{max}\mathit{}\mathrm{(}\alpha \mathrm{)}\mathrm{\}}$ and $\mathrm{max}\mathit{}\mathrm{(}\alpha \mathrm{)}$ denotes the maximum element of set $\alpha $; note that ${\mathrm{\Gamma}}_{\alpha}$ is the set of all $t$ that ${\mathrm{\nabla}}_{\theta}\mathit{}\mathrm{log}\mathit{}\pi \mathit{}\mathrm{(}{a}_{t}\mathrm{}{s}_{t}\mathrm{)}$ should multiply by $\widehat{r}\mathit{}\mathrm{(}{s}_{\alpha}\mathrm{,}{t}_{\alpha}\mathrm{)}$.
II) Equivalently, we have
$$\nabla \widehat{J}(\theta )={\mathbb{E}}_{{\pi}_{\theta}}\left[\sum _{t=0}^{T}{Q}_{t}(\tau ){\nabla}_{\theta}\mathrm{log}\pi ({a}_{t}{s}_{t})\right],$$  (6) 
where ${Q}_{t}$ is a generalized $Q$function, defined to be
$${Q}_{t}(\tau )=\sum _{\alpha \in {\mathrm{\Gamma}}_{t}^{*}}\widehat{r}({s}_{\alpha},{a}_{\alpha}),{\mathrm{\Gamma}}_{t}^{*}=\{\alpha :\mathrm{max}(\alpha )\ge t\}.$$ 
Here ${\mathrm{\Gamma}}_{t}^{\mathrm{*}}$ is the set of all $\alpha $ whose upper bound $\mathrm{max}\mathit{}\mathrm{(}\alpha \mathrm{)}$ excess $t$.
Theorem 1 allows us to leverage decomposition structure of $\widehat{R}(\tau )$ to increase the efficiency of policy gradient. Compared to the basic gradient formula in (1), Eq. (5) keeps only the $\nabla \mathrm{log}\pi ({a}_{t}{s}_{t})$ in set ${\mathrm{\Gamma}}_{\alpha}$ for each local reward $\widehat{r}({s}_{\alpha},{a}_{\alpha}).$ This is obtained by the fact of MDP that present actions only impact the future but not the past, similar to what we have seen in the original policy gradient.
Eq. (6) in Theorem 1 can be viewed as a generalization of policy gradient theorem in Eq. 2, and ${Q}_{t}$ is similar to the definition of typical notion of $Q$function, which corresponds to our special case when each $\alpha \in \mathcal{I}$ includes an individual time steps, that is, $\mathcal{I}=\{\{0\},\{1\},\mathrm{\dots},\{T\}\}$.
By replacing the expectation in Eq. (6) with empirical average from rollout trajectories, we can derive a generalized policy gradient for the composite reward $\widehat{R}(\tau ),$ that is, given a set of trajectories ${\tau}^{i}=\{{({s}_{t}^{i},{a}_{t}^{i})}_{t=0}^{T}\}$, $i=1,\mathrm{\dots},n$, we have
$${\nabla}_{\theta}\widehat{J}(\theta )=\frac{1}{n}\sum _{i=1}^{n}\sum _{t=0}^{T}{Q}_{t}({\tau}^{i})\nabla \mathrm{log}\pi ({a}_{t}^{i}{s}_{t}^{i}),$$ 
where again ${Q}_{t}({\tau}^{i})={\sum}_{\alpha \in {\mathrm{\Gamma}}_{t}^{*}}\widehat{r}({s}_{\alpha}^{i},{a}_{\alpha}^{i})$ is the generalized $Q$ function.
Bias Correction and Control Variates
The method above works well if $\widehat{R}(\tau )$ forms an accurate approximation of $R(\tau )$. However, when the approximation is poor, it introduces a significant bias into the gradient estimation and hence deteriorates the performance. To address this problem, we propose to add the residual term to correct this bias. This yields a gradient estimation of form
$\nabla J(\theta )={\mathbb{E}}_{{\pi}_{\theta}}\left[{r}_{0}(\tau ){\displaystyle \sum _{t=0}^{T}}\nabla \mathrm{log}\pi ({a}_{t}{s}_{t})\right]+\nabla \widehat{J}(\theta ).$  (7) 
where ${r}_{0}(\tau )=R(\tau )\widehat{R}(\tau )$ denotes the residual error, and $\nabla \widehat{J}(\theta )$ is the gradient of composite reward $\widehat{J}(\theta )={\mathbb{E}}_{{\pi}_{\theta}}[\widehat{R}(\tau )]$ in (6) and (7). This allows us to give an unbiased gradient estimation of the true expected reward $J(\theta )$, while being able to leverage the structure of the composite reward.
Theoretically, we can show that the residual corrected gradient estimation (7) can be viewed as a control variate, a widely used approach for reducing variance in policy optimization. In particular, we can show that (7) is equivalent to
$$\nabla J(\theta )={\mathbb{E}}_{{\pi}_{\theta}}\left[\sum _{t=0}^{T}\left(R(\tau ){\widehat{r}}_{\mathrm{\neg}t}(\tau )\right){\nabla}_{\theta}\mathrm{log}\pi ({a}_{t}{s}_{t})\right]$$  (8) 
where ${\widehat{r}}_{\mathrm{\neg}t}(\tau )={\sum}_{\alpha \notin {\mathrm{\Gamma}}_{t}^{*}}\widehat{r}({s}_{\alpha},{a}_{\alpha})$, which collects all the terms that do not appear in (5). We can show that ${\mathbb{E}}_{{\pi}_{\theta}}[{\widehat{r}}_{\mathrm{\neg}t}(\tau ){\nabla}_{\theta}\mathrm{log}\pi ({a}_{t}{s}_{t})]=0$ for all $t$, and hence subtracting ${r}_{\mathrm{\neg}t}$ in (8) does not change the expectation of the formula. Note that ${r}_{\mathrm{\neg}t}$ is a generalization of the standard baseline, which is typically taken to be a constant, or depend only on ${s}_{t}$.
With this unbiased estimator, the policy gradient would be robust so that we no longer need to worry about whether the regression of $\widehat{r}$ is sufficient or the dataset $D$ is not well chosen. In practice, we find that both ${\nabla}_{\theta}\widehat{J}(\theta )$ and ${\nabla}_{\theta}J(\theta )$ works reasonably well if the regression loss is sufficiently optimized.
Discussion.
Note that if $\mathcal{I}$ is the set of individual time steps and $R(\tau )={\sum}_{t}\widehat{r}({s}_{t},{a}_{t})$, Eq. (8) will be reduced to the classic policy gradient as shown in Eq. (2). It is probably more interesting to consider $\mathcal{I}$ to go beyond the set of time steps to include more information. One critical observation from the above derivation is that the interval $\alpha $ can be as large as possible if $\mathrm{max}(\alpha )$ is upper bounded by time $t$, if we want the supervision of $\widehat{r}({s}_{\alpha},{a}_{\alpha})$ for ${\nabla}_{\theta}\mathrm{log}{\pi}_{\theta}({s}_{t},{a}_{t})$. Therefore, to include as much information as possible, it is natural to see that we can define $\mathcal{I}$ as the set of all consecutive subsequences starting from time $0$, so that each $\alpha \in \{\{0,1,2,\mathrm{\dots},t\}:\forall t=0,\mathrm{\dots},T\}$ contains all time steps from the beginning of the trajectory to the current time step $t$. This structure of $\mathcal{I}$ is particularly interesting and highly resembles the forwardlooking structure studied in sequence modeling in NLP, such as language modeling.
2.4 Learning Temporal Credit Assignment using Language Models
Motivated by the forwardlooking structure, we propose to adopt neuralnetwork language models for learning the reward function $\widehat{r}$ for credit assignment. In nature language processing, a language model assigns probability to a given sequence in a language [5]. A more tangible and related model is to assign probability of an upcoming word given a sequence of prior words. More formally, a language model predicts the probability of word ${{w}}_{{t}}$ by parameterizing the conditional distribution ${p}{}{(}{{w}}_{{t}}{}{{w}}_{{1}}{,}{{w}}_{{2}}{,}{\mathrm{\dots}}{,}{{w}}_{{t}{}{1}}{)}$. Such models would be very useful in many applications, especially those generating sequences as output. Notable examples including the ngram models and the recurrent neural networks, which attempt to capture medium to longrange dependencies in the sentence. Very recently, a model entirely based on attention mechanisms was proposed for language modeling and has achieved stateoftheart performance on neural machine translation and other NLP tasks [40, 7]. The model, called Transformer, has an encoderdecoder structure and is composed of stacked selfattention and fully connected layers, without using any recurrence or convolution. To attend multiple parts of the input sequence simultaneously, instead of using a single large attention “head", the Transformer uses multiple small attention heads to project the input sequence into multiple subspaces and combines the attention outputs by concatenation. Note that in Transformer, the selfattention for constructing an latent vector at a word is based on all other words prior to the current one, which highly resembles the desideratum for the reward function, as discussed above. Furthermore, it is also possible to use recurrent neural networks or convolutional neural networks to model the reward function.
In this work, we consider the Transformer network model, because of its superior performance and interpretability. Our model for learning the reward function $\widehat{r}$ consists of two parts. The first part is an encoder module of the Transformer, which is composed of a multihead attention layer followed by a positionwise fullyconnected feedforward layer. The second part is a selfattention layer [22], which generates a set of summation weight vectors for the encoder outputs. The set of summation weight vectors are used to multiply with the Transformer outputs, resulting in a hidden representation which is then processed by a regression layer to give the predicted rewards $\widehat{r}$.
Specifically, suppose we have a trajectory that has $n$ stateaction pairs, represented as $\tau =\{{({s}_{t},{a}_{t})}_{t=0}^{T1}\}$. Here ${s}_{t}$ is a ${d}_{s}$dimensional observation vector and ${a}_{t}$ is a ${d}_{a}$dimensional action vector. $\tau $ is thus represented as a $T$ by $({d}_{s}+{d}_{a})$ matrix. Note that we omit the special terminal state ${s}_{T}$ here for notational simplicity, if not causing further confusion. Each stateaction pair $({s}_{t},{a}_{t})$ in the trajectory is then processed by a feed forward layer, whose parameters are shared across all time steps, to give a fixedlength vector representation ${\mathbf{v}}_{t}$ for this stateaction pair. To gain the dependency between the current time step $t$ and other time steps before $t$, we use a encoder layer to process the stateaction pairs: ${\mathbf{h}}_{t}=\text{Transformer}({\mathbf{v}}_{0},{\mathbf{v}}_{1},\mathrm{\dots},{\mathbf{v}}_{t})$. Let the dimension of each startaction pair representation vector ${\mathbf{v}}_{t}$ be $d$. Since the Transformer network does not change the dimension of input vectors, we represent all the $T$ representation vectors ${\mathbf{h}}_{t}$ as a $T$ by $d$ matrix $H=({\mathbf{h}}_{0},{\mathbf{h}}_{1},\mathrm{\dots},{\mathbf{h}}_{T1})$. To summarize the information in $H$, we apply a selfattention mechanism:
$$\mathbf{z}=\text{sigmoid}{({\mathbf{w}}_{s2}\mathrm{tanh}({W}_{s1}{H}^{\top}))}^{\top}.$$ 
Here ${W}_{s1}$ is a weight matrix with dimension ${d}_{z}$ by $d$ and ${\mathbf{w}}_{s2}$ is a vector of parameters with size ${d}_{z}$, where ${d}_{z}$ is a hyperparameter. Vector $\mathbf{z}$ has size $T$, and each entry ${z}_{t}$ ranges from 0 to 1, quantifying the importance of the stateaction pair $({s}_{t},{a}_{t})$ in predicting the reward $r({s}_{{\alpha}_{t}},{a}_{{\alpha}_{t}})$ for interval ${\alpha}_{t}=\{0,1,\mathrm{\dots},t\}$. We combine the hidden representations in $H$ using $\mathbf{z}$ to obtain a summarized representation ${\mathbf{h}}_{t}^{*}={z}_{t}{\mathbf{h}}_{t}$. To predict the reward $\widehat{r}({s}_{{\alpha}_{t}},{a}_{{\alpha}_{t}})$, we add one regression layer parameterized by ${\mathbf{w}}_{r}$ and ${b}_{r}$ and output the predicted reward as
$$\widehat{r}({s}_{{\alpha}_{t}},{a}_{{\alpha}_{t}})={\mathbf{w}}_{r}^{\top}{\mathbf{h}}_{\mathbf{t}}^{*}+{b}_{r}.$$ 
Some other networks we consider include the feedforward neural network and long shortterm memory network (LSTM). Please see Figure A2 to see their differences.
3 Experiments
In this section, we conduct experiments to provide evidences for the following questions? (1) Is the learned reward function useful for improving policy optimization in episodic RL? (2) What are the appropriate choices of interval set $\mathcal{I}$, neural network model, and dataset $D$? Note that we also provide a case study on visualization and intepretability of the learned reward function in Appendix.
3.1 Experimental Settings
The experiments are conducted on a set of highdimensional locomotion tasks in continuous domain using OpenAI gym [6] and MuJoCo simulation toolkits. We use the PPO algorithm introduced in [32] for all the experiments. The policy is represented by a unimode Gaussian distribution with diagonal covariance. The mean is parameterized by a twolayer neural network with 64 hidden units and tanh nonlinearity. The log standard deviation is parameterized by a global vector. The same architecture is applied for value function approximation. In practice, we implemented the policy gradient estimation shown in Eq. (7), where we decompose into two parts: one with the predicted reward, the other with only the residual. In this way, we can use advanced variance reduction approaches to improve the estimation of each components. In our experiments, we use the generalized advantage estimation (GAE) [31] as the control variate method for variance reduction. More hyperparameters of PPO and GAE are tabulated in Table A1 in Appendix. We also compare against baseline algorithms trained with episodic reward, respectively. The episodic reward of each rollout trajectory is defined as the accumulated original reward at all timesteps. The experiments are run for 5M timesteps with 5 random seeds. For baseline algorithms, we also consider an LSTM policy with 128 hidden units besides the aforementioned MLP policy. All experiments are conducted on NVIDIA 1080 GPUs.
Hopper  Walker2d  Humanoid  HumanoidStandup  Swimmer  

PPO (episodic)  437  266  516  44673  6 
CEM  97  205  426  $\approx 9.6\times {10}^{4}$  17 
Ours  1462  3217  2209  82579  135 
For the reward function in our experiments, we investigate three neural network architectures, FeedForward (FF) Network, LSTM [13], and Transformer, to identify the best network structure that captures temporal dependency (Figure A2). For observation and action ${s}_{t},{a}_{t}$ at time $t$, an FF network, whose parameter $\varphi $ is shared across all time steps, is used to give the predicted reward ${\widehat{r}}_{t}$ for individual time steps. LSTM’s reward prediction is also conditioned on all previous timesteps, so is a function of trajectory segment ${s}_{0:t},{a}_{0:t}$. To perform reward regression based on the LSTM/FF outputs, we aggregate the the representations using a meanpooling layer. The hyperparameters of LSTM and FF are also listed in A1.
For the buffer updating schemes, we proposed the following three approaches to renew the trajectory buffer for reward predicting.

•
Online (O): The buffer is implemented as a FIFO queue with length $K$ (hyperparameter). In each iteration, the new rollout trajectories will be inserted into the queue and the old trajectories will be removed if the queue is too long.

•
Historical+Online (HO): Two queues are maintained: One queue is the same as the one in Online, storing the most recent rollouts by the current policy. The other queue stores the trajectories with the highest episodic returns from previous rollouts.

•
StratifiedSampling (S): To balance the training with episodic return regression, the buffer stores a larger number of trajectories in the history. In each iteration, $K$ trajectories are sampled to ensure the episodic return is uniformly in the five bins. The queue size $L$ and sample queue size $K$ are hyperparameters.
3.2 Credit Assignment Enables Policy Optimization with Episodic Reward
Our experiments with the baseline algorithms on the MuJoCo control suite demonstrated that learning with episodic return is extremely hard both with MLP policy and LSTM policy. In all the environments we tested (Figure 1), it is not surprising that the baseline method PPO (episodic) with episodic reward performs much worse than previously reported in other papers with dense reward [32]. In most of the runs, the policy cannot make any improvement after the initial time steps.
On the other hand, our proposed creditassignment algorithm with learned return decomposition consistently achieves better performance than the episodic return baselines across all environments (Figure 1). Here we use the Transformer network for decomposing rewards and MLP for representing policy, and the HO strategy to collect trajectories for optimization. In many experiments, the policies learned by our method are able to achieve the quite reasonably good performance when the original dense reward is used for training. In environments such as the Humanoid, the Hopper and the Swimmer, with the appropriate hyperparameters, our methods can obtain comparable performance to the policies trained with the original dense rewards, outperforming the episodic return baseline by a large margin. In addition, it also outperforms the crossentropy method (CEM) which is suitable for the episodic RL setting (Table 1). These results indicate that the our reward decomposition framework can successfully enable stable learning and greatly improve the sample efficiency in episodic settings.
3.3 Ablation Analysis
Here we perform comparisons to check the choices of 1) three network structures for reward function; 2) strategies for data collection. We also check the utility of bias correction when the reward regression is not well fitted. We use the environment Walker2d to perform these analyses and demonstrate the results in Figure 2.
As we can see, all the three networks structures outperform the episodic baselines, and the Transformer network provides the best performance. We conjecture that the reason why Transformer performs better than LSTM is that it is easier to train, as also implied by existing NLP literature [40]. Then we check whether the data collection strategy has any impact on the performance. All three strategies seem to work reasonably well, while HO appears to be the best which indicating that the historical highquality trajectories helps reward learning. Finally, we compare the performance of ${\nabla}_{\theta}J(\theta )$ and ${\nabla}_{\theta}\widehat{J}(\theta )$ under different learning rates (${10}^{2}$ and ${10}^{3}$) for the regression loss. We use different learning rates essentially to adjust the quality of the reward fitting. It is easy to see that with the bias correction, the learning of ${\nabla}_{\theta}J(\theta )$ is more robust, indicating that bias correction is needed. Extra results on ablation analysis can be found in the appendix (Figures A3, A4).
4 Related Work
The optimal reward design problem [35, 36] concerns about finding the proxy reward function can obtains high expected return according to the true reward function. Inverse reward design [10] studies the opposite problem of inferring true reward from designed reward. Intrinsic motivation has been shown to improve sample efficiency of RL algorithms, for example by using information gain [15], pseudo count [4] or prediction error [37, 28] as an intrinsic bonus reward to aid exploration. [42] studies intrinsic motivation under the optimal reward design framework where the optimal intrinsic reward is learned through gradient descent. Reward shaping [26] explores the space of reward function modifications (specifically potentialbased rewards) which do not change the corresponding optimal policy. Hindsight Experience Replay [2] adds additional goals and corresponding rewards to a Qlearning algorithm. Metalearning can also be utilized to learn different objective functions for different goals [14]. Our approach is more general as we don’t assume specific goals of the agent. When only expert demonstrations are available as in imitation learning, inverse reinforcement learning can be used to recover the reward function from expert trajectories [1]. As an alternative approach for solving sparse reward problems, the auxiliarytask approaches [17] focus on improving the representation by adding extra selfsupervised losses. However, our proposed method learns to decompose the episodic return as reward for policy optimization directly.
Recently, several concurrent works have explored the same direction of decomposing the episodic return. Sparse Attentive Backtracking[19] applies the attentive mechanism for network straining, while ours focuses on learning the decomposition of the reward signal. Temporal Value Transport (TVT) [16] relies on a memory reconstitutive module to retrieve past data for the visual RL problems, however, ours aims at decomposing the episodic return for general problems. As one most related work, RUDDER [3] could be viewed as a special case of our work in which consecutive timesteps were used. On the other hand, our work provides a general framework with interval rewards which justifies the applicability of language models and the correctness of generalized policy gradient beyond the simple Markovian assumption.
5 Conclusion
We presented a new algorithm for learning temporal credit assignment, which uses deep neural network (Transformer) to decompose the episodic reward back to each time step in the trajectory. The assigned dense reward signals obtained from the decomposition are then used to guide training algorithms of policies. We demonstrated that our credit assignment algorithm substantially improved the learning and sample efficiency on a set of MuJoCo continuous locomotive control tasks. The reward function learned by our algorithm can also be interpreted by an attention mechanism, potentially providing insights on identifying key stateaction pairs that contribute to successful reinforcement learning.
References
 [1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twentyfirst international conference on Machine learning, page 1. ACM, 2004.
 [2] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058, 2017.
 [3] Jose A ArjonaMedina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, and Sepp Hochreiter. Rudder: Return decomposition for delayed rewards. arXiv preprint arXiv:1806.07857, 2018.
 [4] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying countbased exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.
 [5] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.
 [6] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 [7] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
 [8] Tanmay Gangwani, Qiang Liu, and Jian Peng. Learning selfimitating diverse policies. Proceedings of the International Conference on Learning Representations (ICLR), 2019.
 [9] Yijie Guo, Junhyuk Oh, Satinder Singh, and Honglak Lee. Generative adversarial selfimitation learning. arXiv preprint arXiv:1812.00950, 2018.
 [10] Dylan HadfieldMenell, Smitha Milli, Pieter Abbeel, Stuart J Russell, and Anca Dragan. Inverse reward design. In Advances in Neural Information Processing Systems, pages 6765–6774, 2017.
 [11] Nikolaus Hansen and Andreas Ostermeier. Completely derandomized selfadaptation in evolution strategies. Evolutionary computation, 9(2):159–195, 2001.
 [12] Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017.
 [13] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [14] Rein Houthooft, Richard Y Chen, Phillip Isola, Bradly C Stadie, Filip Wolski, Jonathan Ho, and Pieter Abbeel. Evolved policy gradients. arXiv preprint arXiv:1802.04821, 2018.
 [15] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pages 1109–1117, 2016.
 [16] ChiaChun Hung, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico Carnevale, Arun Ahuja, and Greg Wayne. Optimizing agent behavior over long time scales by transporting value. arXiv preprint arXiv:1810.06721, 2018.
 [17] Max Jaderberg, Volodymyr Mnih, Wojciech Czarnecki, Tom Schaul, Joel Z. Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. CoRR, abs/1611.05397, 2017.
 [18] Daniel Jurafsky and James H. Martin. Speech and Language Processing (2Nd Edition). PrenticeHall, Inc., Upper Saddle River, NJ, USA, 2009.
 [19] Nan Rosemary Ke, Anirudh Goyal, Olexa Bilaniuk, Jonathan Binas, Michael C. Mozer, Chris Pal, and Yoshua Bengio. Sparse attentive backtracking: Temporal credit assignment through reminding. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, pages 7651–7662, USA, 2018. Curran Associates Inc.
 [20] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
 [21] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. ICLR, 2016.
 [22] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured selfattentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.
 [23] Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. Resource management with deep reinforcement learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks, pages 50–56. ACM, 2016.
 [24] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
 [25] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 [26] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pages 278–287, 1999.
 [27] Marcus Olivecrona, Thomas Blaschke, Ola Engkvist, and Hongming Chen. Molecular denovo design through deep reinforcement learning. Journal of cheminformatics, 9(1):48, 2017.
 [28] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiositydriven exploration by selfsupervised prediction. In International Conference on Machine Learning (ICML), volume 2017, 2017.
 [29] Reuven Rubinstein. The crossentropy method for combinatorial and continuous optimization. Methodology and computing in applied probability, 1(2):127–190, 1999.
 [30] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
 [31] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. Proceedings of the International Conference on Learning Representations (ICLR), 2016.
 [32] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 [33] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
 [34] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
 [35] Satinder Singh, Richard L Lewis, and Andrew G Barto. Where do rewards come from. In Proceedings of the annual conference of the cognitive science society, pages 2601–2606, 2009.
 [36] Jonathan Sorg, Richard L Lewis, and Satinder P Singh. Reward design via online gradient ascent. In Advances in Neural Information Processing Systems, pages 2190–2198, 2010.
 [37] Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.
 [38] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
 [39] Richard Stuart Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts Amherst, 1984. AAI8410337.
 [40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
 [41] Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
 [42] Zeyu Zheng, Junhyuk Oh, and Satinder Singh. On learning intrinsic rewards for policy gradient methods. In Advances in Neural Information Processing Systems, pages 4649–4659, 2018.
Appendix A Method Overview and Network structures
Appendix B Proof of Theorem 1
I) Recall the standard likelihood ratio gradient formula:
${\nabla}_{\theta}\widehat{J}(\theta )$  $={\nabla}_{\theta}{\mathbb{E}}_{{\pi}_{\theta}}[\widehat{R}(\tau )]$  
$={\mathbb{E}}_{{\pi}_{\theta}}\left[\widehat{R}(\tau ){\displaystyle \sum _{t=0}^{T}}{\nabla}_{\theta}\mathrm{log}\pi ({a}_{t}{s}_{t})\right]$  
$={\mathbb{E}}_{{\pi}_{\theta}}\left[\left({\displaystyle \sum _{\alpha \in \mathcal{I}}}\widehat{r}({s}_{\alpha},{a}_{\alpha})\right)\left({\displaystyle \sum _{t=0}^{T}}{\nabla}_{\theta}\mathrm{log}\pi ({a}_{t}{s}_{t})\right)\right]$  
$={\displaystyle \sum _{\alpha \in \mathcal{I}}}{\mathbb{E}}_{{\pi}_{\theta}}\left[\widehat{r}({s}_{\alpha},{a}_{\alpha}){\displaystyle \sum _{t=0}^{T}}{\nabla}_{\theta}\mathrm{log}\pi ({a}_{t}{s}_{t})\right].$  (9) 
On the other hand, note that for any $t\notin {\mathrm{\Gamma}}_{\alpha}$, we have
${\mathbb{E}}_{{\pi}_{\theta}}[\widehat{r}({s}_{\alpha},{s}_{\alpha}){\nabla}_{\theta}\mathrm{log}\pi ({a}_{t}{s}_{t})]=0.$ 
Therefore, all the pairs $(\alpha ,t)$ with $t\notin {\mathrm{\Gamma}}_{\alpha}$ is removed in (9). This hence yields (5).
II) Eq. (6) is a simple rearrangement of Eq. (5).
${\nabla}_{\theta}\widehat{J}(\theta )$  $={\displaystyle \sum _{\alpha \in \mathcal{I}}}{\mathbb{E}}_{{\pi}_{\theta}}\left[\widehat{r}({s}_{\alpha},{a}_{\alpha}){\displaystyle \sum _{t\in {\mathrm{\Gamma}}_{\alpha}}}{\nabla}_{\theta}\mathrm{log}\pi ({a}_{t}{s}_{t})\right]$  
$={\mathbb{E}}_{{\pi}_{\theta}}\left[{\displaystyle \sum _{t=0}^{T}}\left({\displaystyle \sum _{\alpha :t\ni {\mathrm{\Gamma}}_{\alpha}}}\widehat{r}({s}_{\alpha},{a}_{\alpha})\right){\nabla}_{\theta}\mathrm{log}\pi ({a}_{t}{s}_{t})\right],$ 
where $({\sum}_{\alpha :t\ni {\mathrm{\Gamma}}_{\alpha}}\widehat{r}({s}_{\alpha},{a}_{\alpha})={Q}_{t}(\tau )$, matching our definition. This completes the proof.
Appendix C Hyperparameters
Hyperparameter  Searching Values 

Policy Network  MLP with shape (64, 64) 
PPO Batch Size  2048 
PPO MiniBatch Size  64 
PPO number of epoch per iteration  5 
PPO learning rate  0.0001 
PPO clip range $\u03f5$  0.2 
GAE $\gamma $  0.99 
GAE $\lambda $  0.95 
Buffer size  50 
Reward predictor learning rate  $0.001$ 
Transformer number of heads  4 
Transformer layer size  64 
Transformer hidden layer size  128 
Transformer query/key size  32 
LSTM hidden size  96 
FF channels  $[128,128,128,256]$ 
Appendix D More Results on Ablation Analysis
Appendix E Interpretability of the Learned Reward Function
We visualize the learned agent and reward function to demonstrate that the reward predictor could attain knowledge from interacting with environment. Key state action pairs that contribute to successful reinforcement learning can also identified from the visualization . Using the Hopper environment as an example, we visualized in Figure A5 the learned temporal attentions for 1000 time steps, extracted from the last layer of the Transformer network. The first stage A corresponds to the agent starting a large jump, followed by the landing (stage B) and laying on the ground (stage C). The agent then makes a smaller hop in stage D and lands in stage E. We observed that the temporal attentions exhibit periodic behavior that coincides with the agent’s periodic motion. In the Hopper environment, the goal is to make the 2D onelegged robot move forward as fast as possible. A successfully trained agent essentially learns to hop forward periodically. Hence it is reasonable that the temporal attention and predicted reward show periodicity as the hopper jumps. Similar periodic behaviors of moving and adjusting balance were also be observed in other locomotion environments such as Humanoid.