Abstract
Many reinforcement learning (RL) tasks have specific properties that can beleveraged to modify existing RL algorithms to adapt to those tasks and furtherimprove performance, and a general class of such properties is the multiplereward channel. In those environments the full reward can be decomposed intosubrewards obtained from different channels. Existing work on rewarddecomposition either requires prior knowledge of the environment to decomposethe full reward, or decomposes reward without prior knowledge but with degradedperformance. In this paper, we propose Distributional Reward Decomposition forReinforcement Learning (DRDRL), a novel reward decomposition algorithm whichcaptures the multiple reward channel structure under distributional setting.Empirically, our method captures the multichannel structure and discoversmeaningful reward decomposition, without any requirements on prior knowledge.Consequently, our agent achieves better performance than existing methods onenvironments with multiple reward channels.
Quick Read (beta)
Distributional Reward Decomposition for Reinforcement Learning
Abstract
Many reinforcement learning (RL) tasks have specific properties that can be leveraged to modify existing RL algorithms to adapt to those tasks and further improve performance, and a general class of such properties is the multiple reward channel. In those environments the full reward can be decomposed into subrewards obtained from different channels. Existing work on reward decomposition either requires prior knowledge of the environment to decompose the full reward, or decomposes reward without prior knowledge but with degraded performance. In this paper, we propose Distributional Reward Decomposition for Reinforcement Learning (DRDRL), a novel reward decomposition algorithm which captures the multiple reward channel structure under distributional setting. Empirically, our method captures the multichannel structure and discovers meaningful reward decomposition, without any requirements on prior knowledge. Consequently, our agent achieves better performance than existing methods on environments with multiple reward channels.
Distributional Reward Decomposition for Reinforcement Learning
Zichuan Lin^{†}^{†}thanks: Contributed during internship at Microsoft Research. Tsinghua University [email protected] Li Zhao Microsoft Research [email protected] Derek Yang UC San Diego [email protected] Tao Qin Microsoft Research [email protected] Guangwen Yang Tsinghua University [email protected] TieYan Liu Microsoft Research [email protected]
noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\[email protected]
1 Introduction
Reinforcement learning has achieved great success in decision making problems since Deep Qlearning was proposed by mnih2015human. While general RL algorithms have been deeply studied, here we focus on those RL tasks with specific properties that can be utilized to modify general RL algorithms to achieve better performance. Specifically, we focus on RL environments with multiple reward channels, but only the full reward is available.
Reward decomposition has been proposed to investigate such properties. For example, in Atari game Seaquest, rewards of environment can be decomposed into subrewards of shooting sharks and those of rescuing divers. Reward decomposition views the total reward as the sum of subrewards that are usually disentangled and can be obtained independently (sprague2003multiple; russell2003q; van2017hybrid; grimm2019learning), and aims at decomposing the total reward into subrewards. The subrewards may further be leveraged to learn better policies.
van2017hybrid propose to split a state into different substates, each with a subreward obtained by training a general value function, and learn multiple value functions with subrewards. The architecture is rather limited due to requiring prior knowledge of how to split into substates. grimm2019learning propose a more general method for reward decomposition via maximizing disentanglement between subrewards. In their work, an explicit reward decomposition is learned via maximizing the disentanglement of two subrewards estimated with actionvalue functions. However, their work requires that the environment can be reset to arbitrary state and can not apply to general RL settings where states can hardly be revisited. Furthermore, despite the meaningful reward decomposition they achieved, they fail to utilize the reward decomposition into learning better policies.
In this paper, we propose Distributional Reward Decomposition for Reinforcement Learning (DRDRL), an RL algorithm that captures the latent multiplechannel structure for reward, under the setting of distributional RL. Distributional RL differs from valuebased RL in that it estimates the distribution rather than the expectation of returns, and therefore captures richer information than valuebased RL. We propose an RL algorithm that estimates distributions of the subreturns, and combine the subreturns to get the distribution of the total returns. In order to avoid naive decomposition such as 01 or halfhalf, we further propose a disentanglement regularization term to encourage the subreturns to be diverged. To better separate reward channels, we also design our network to learn different state representations for different channels.
We test our algorithm on chosen Atari Games with multiple reward channels. Empirically, our method has following achievements:

•
Discovers meaningful reward decomposition.

•
Requires no external information.

•
Achieves better performance than existing RL methods.
2 Background
We consider a general reinforcement learning setting, in which the interaction of the agent and the environment can be viewed as a Markov Decision Process (MDP). Denote the state space by $\mathcal{X}$, action space by $A$, the state transition function by $P$, the actionstate dependent reward function by $R$ and $\gamma $ the discount factor, we write this MDP as $(\mathcal{X},A,R,P,\gamma )$.
Given a fixed policy $\pi $, reinforcement learning estimates the actionvalue function of $\pi $, defined by ${Q}^{\pi}(x,a)={\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}{r}_{t}({x}_{t},{a}_{t})$ where $({x}_{t},{a}_{t})$ is the stateaction pair at time $t$, ${x}_{0}=x,{a}_{0}=a$ and ${r}_{t}$ is the corresponding reward. The Bellman equation characterizes the actionvalue function by temporal equivalence, given by
$${Q}^{\pi}(x,a)=R(x,a)+\gamma \underset{{x}^{\prime},{a}^{\prime}}{\mathbb{E}}\left[{Q}^{\pi}({x}^{\prime},{a}^{\prime})\right]$$ 
where ${x}^{\prime}\sim P(\cdot x,a),{a}^{\prime}\sim \pi (\cdot {x}^{\prime})$. To maximize the total return given by $\underset{{x}_{0},{a}_{0}}{\mathbb{E}}[{Q}^{\pi}({x}_{0},{a}_{0})]$, one common approach is to find the fixed point for the Bellman optimality operator
$${Q}^{*}(x,a)=\mathcal{T}{Q}^{*}(x,a)=R(x,a)+\gamma \underset{{x}^{\prime}}{\mathbb{E}}\left[\underset{{a}^{\prime}}{\mathrm{max}}{Q}^{*}({x}^{\prime},{a}^{\prime})\right]$$ 
with the temporal difference (TD) error, given by
$${\delta}_{t}^{2}={\left[{r}_{t}+\gamma \underset{{a}^{\prime}\in \mathcal{A}}{\mathrm{max}}Q({x}_{t+1},{a}^{\prime})Q({x}_{t},{a}_{t})\right]}^{2}$$ 
over the samples $({x}_{t},{a}_{t},{s}_{t},{x}_{t+1})$ along the trajectory. mnih2015human propose Deep QNetworks (DQN) that learns the actionvalue function with a neural network and achieves humanlevel performance on the Atari57 benchmark.
2.1 Reward Decomposition
Studies for reward decomposition also leads to state decomposition (laversanne2018curiosity; thomas2017independently), where state decomposition is leveraged to learn different policies. Extending their work, grimm2019learning explore the decomposition of the reward function directly, which is considered to be most related to our work. Denote the $i$th ($i$=1,2,…,$N$) subreward function at stateaction pair $(x,a)$ as ${r}_{i}(x,a)$, the complete reward function is given by
$$r=\sum _{i=1}^{N}{r}_{i}$$ 
For each subreward function, consider the subvalue function ${U}_{i}^{\pi}$ and corresponding policy ${\pi}_{i}$:
${U}_{i}^{\pi}({x}_{0},{a}_{0})={\mathbb{E}}_{{x}_{t},{a}_{t}}[{\displaystyle \sum _{t=0}^{\mathrm{\infty}}}{\gamma}^{t}{r}_{i}({x}_{t},{a}_{t})]$  
${\pi}_{i}=\underset{\pi}{\mathrm{arg}\mathrm{max}}{U}_{i}^{\pi}$ 
where ${x}_{t}\sim P(\cdot \pi ,{x}_{0},{a}_{0}),{a}_{t}\sim \pi (\cdot {x}_{t})$.
In their work, reward decomposition is considered meaningful if each reward is obtained independently (i.e. ${\pi}_{i}$ should not obtain ${r}_{j}$) and each reward is obtainable.
Two evaluate the two desiderata, the work proposes the following values:
$${J}_{independent}({r}_{1},\mathrm{\dots},{r}_{n})={\mathbb{E}}_{s\sim \mu}\left[\sum _{i\ne j}{\alpha}_{i,j}(s){U}_{i}^{{\pi}_{j}^{*}}(s)\right],$$  (1) 
$${J}_{nontrivial}({r}_{1},\mathrm{\dots},{r}_{n})={\mathbb{E}}_{s\sim \mu}\left[\sum _{i=1}^{n}{\alpha}_{i,i}(s){U}_{i}^{{\pi}_{i}^{*}}(s)\right],$$  (2) 
where ${\alpha}_{i,j}$ is for weight control and for simplicity set to 1 in their work. During training, the network would maximize ${J}_{nontrivial}{J}_{independent}$ to achieve the desired reward decomposition.
2.2 Distributional Reinforcement Learning
In most reinforcement learning settings, the environment is not deterministic. Moreover, in general people train RL models with an $\u03f5$greedy policy to allow exploration, making the agent also stochastic. To better analyze the randomness under this setting, bellemare2017distributional propose C51 algorithm and conduct theoretical analysis of distributional RL.
In distributional RL, reward ${R}_{t}$ is viewed as a random variable, and the total return is defined by $Z={\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}{R}_{t}$. The expectation of $Z$ is the traditional actionvalue $Q$ and the distributional Bellman optimality operator is given by
$$\mathcal{T}Z(x,a):\stackrel{D}{=}R(x,a)+\gamma Z({x}^{\prime},\underset{{a}^{\prime}\in \mathcal{A}}{\mathrm{arg}\mathrm{max}}\mathbb{E}Z({x}^{\prime},{a}^{\prime}))$$ 
where if random variable $A$ and $B$ satisfies $A\stackrel{D}{=}B$ then $A$ and $B$ follow the same distribution.
Random variable is characterized by a categorical distribution over a fixed set of values in C51, and it outperforms all previous variants of DQN on Atari domain.
3 Distributional Reward Decomposition for Reinforcement Learning
3.1 Distributional Reward Decomposition
In many reinforcement learning environments, there are multiple sources for an agent to receive reward as shown in Figure 1. Our method is mainly designed for environments with such property.
Under distributional setting, we will assume reward and subrewards are random variables and denote them by $R$ and ${R}_{i}$ respectively. In our architecture, the categorical distribution of each subreturn ${Z}_{i}={\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}{R}_{i,t}$ is the output of a network, denoted by ${\mathcal{F}}_{i}(x,a)$. Note that in most cases, subreturns are not independent, i.e. $P({Z}_{i}=v)!=P({Z}_{i}=v{Z}_{j})$. So theoretically we need ${\mathcal{F}}_{ij}(x,a)$ for each $i$ and $j$ to obtain the distribution of the full return. We call this architecture as nonfactorial model or fulldistribution model. The nonfactorial model architecture is shown in appendix. However, experiment shows that using an approximation form of $P({Z}_{i}=v)\approx P({Z}_{i}=v{Z}_{j})$ so that only ${\mathcal{F}}_{i}(x,a)$ is required performs much better than brutally computing ${\mathcal{F}}_{ij}(x,a)$ for all $i,j$, we believe that is due to the increased sample number. In this paper, we approximate the conditional probability $P({Z}_{i}=v{Z}_{j})$ with $P({Z}_{i}=v)$.
Consider categorical distribution function ${\mathcal{F}}_{i}$ and ${\mathcal{F}}_{j}$ with same number of atoms $K$, the $k$th atom is denoted by ${a}_{k}$ with value ${a}_{k}={a}_{0}+kl,1\le k\le K$ where $l$ is a constant. Let random variable ${Z}_{i}\sim {\mathcal{F}}_{i}$ and ${Z}_{j}\sim {\mathcal{F}}_{j}$, from basic probability theory we know that the distribution function of $Z={Z}_{i}+{Z}_{j}$ is the convolution of ${F}_{i}$ and ${F}_{j}$
$$\begin{array}{cc}\hfill \mathcal{F}(v)=P({Z}_{i}+{Z}_{j}=v)& =\sum _{k=1}^{K}P({Z}_{i}={a}_{k})P({Z}_{j}=v{a}_{k}{Z}_{i}={a}_{k})\hfill \\ & \approx \sum _{k=1}^{K}P({Z}_{i}={a}_{k})P({Z}_{j}=v{a}_{k})={\mathcal{F}}_{i}(v)*{\mathcal{F}}_{j}(v).\hfill \end{array}$$  (3) 
When we use $N$ subreturns, the distribution function of the total return is then given by $\mathcal{F}={\mathcal{F}}_{1}*{\mathcal{F}}_{2}*\mathrm{\cdots}*{\mathcal{F}}_{N}$ where $*$ denotes linear 1Dconvolution.
While reward decomposition is not explicitly done in our algorithm, we can derive the decomposed reward with using trained agents. Recall that total return $Z={\sum}_{i=1}^{N}{Z}_{i}$ follows bellman equation, so naturally we have
$$\mathcal{T}Z\stackrel{D}{=}\mathcal{T}(\sum _{i=1}^{N}{Z}_{i})\stackrel{D}{=}R+\gamma {Z}^{\prime}=(\sum _{i=1}^{N}{R}_{i})+\gamma (\sum _{i=1}^{N}{Z}_{i}^{\prime})$$  (4) 
where ${Z}_{i}^{\prime}$ represents subreturn on the next stateaction pair. Note that we only have access to a sample of the full reward $R$, the subrewards ${R}_{i}$ are arbitrary and for better visualization a direct way of deriving them is given by
$${R}_{i}={Z}_{i}\gamma {Z}_{i}^{\prime}$$  (5) 
In the next section we will present an example of those subrewards by taking their expectation $\mathbb{E}({R}_{i})$. Note that our reward decomposition is latent and we do not need ${R}_{i}$ for our algorithm, Eq. 5 only provides an approach to visualize our reward decomposition.
3.2 Disentangled Subreturns
To obtain meaningful reward decomposition, we want the subrewards to be disentangled. Inspired by grimm2019learning, we compute the disentanglement of distributions of two subreturns ${F}^{i}$ and ${F}^{j}$ on state $x$ with the following value:
$${J}_{disentang}^{ij}={D}_{KL}({\mathcal{F}}_{x,{\mathrm{arg}\mathrm{max}}_{a}\mathbb{E}({Z}_{i})}{\mathcal{F}}_{x,{\mathrm{arg}\mathrm{max}}_{a}\mathbb{E}({Z}_{j})}),$$  (6) 
where ${D}_{KL}$ denotes the crossentropy term of KL divergence.
Intuitively, ${J}_{disentang}^{ij}$ estimates the disentanglement of subreturns ${Z}_{i}$ and ${Z}_{j}$ by first obtaining actions that maximize $\mathbb{E}({Z}_{i})$ and $\mathbb{E}({Z}_{j})$ respectively, and then computes the KL divergence between the two estimated total returns of the actions. If ${Z}_{i}$ and ${Z}_{j}$ are independent, the action maximizing two subreturns would be different and such difference would reflect in the estimation for total return. Through maximizing this value, we can expect a meaningful reward decomposition that learns independent rewards.
3.3 Projected Bellman Update with Regularization
Following the work of C51 (bellemare2017distributional), we use projected Bellman update for our algorithm. When applied with the Bellman optimality operator, the atoms of $\mathcal{T}Z$ is shifted by ${r}_{t}$ and shrank by $\gamma $. However to compute the loss, usually KL divergence between $Z$ and $\mathcal{T}Z$, it is required that the two categorical distributions are defined on the same set of atoms, so the target distribution $\mathcal{T}Z$ would need to be projected to the original set of atoms before Bellman update. Consider a sample transition $(x,a,r,{x}^{\prime})$, the projection operator $\mathrm{\Phi}$ proposed in C51 is given by
$${\left(\mathrm{\Phi}\mathcal{T}Z(x,a)\right)}_{i}=\sum _{j=0}^{M1}{\left[1\frac{\left{\left[r+\gamma {a}_{j}\right]}_{{V}_{min}}^{{V}_{max}}{a}_{i}\right}{l}\right]}_{0}^{1}{\mathcal{F}}_{{x}^{\prime},{a}^{\prime}}\left({a}_{j}\right),$$  (7) 
where $M$ is the number of atoms in C51 and ${[\cdot ]}_{a}^{b}$ bounds its argument in $[a,b]$. The sample loss for $(x,a,r,{x}^{\prime})$ would be given by the crossentropy term of KL divergence of $Z$ and $\mathrm{\Phi}\mathcal{T}Z$
$${\mathcal{L}}_{x,a,r,{x}^{\prime}}={D}_{KL}(\mathrm{\Phi}\mathcal{T}Z(x,a)Z(x,a)).$$  (8) 
Let ${\mathcal{F}}^{\theta}$ be a neural network parameterized by $\theta $, we combine distributional TD error and disentanglement to jointly update $\theta $. For each sample transition $(x,a,r,{x}^{\prime})$, $\theta $ is updated by minimizing the following objective function:
$${\mathcal{L}}_{x,a,r,{x}^{\prime}}\lambda \sum _{i}\sum _{j!=i}{J}_{disentang}^{ij},$$  (9) 
where $\lambda $ denotes learning rate.
3.4 Multichannel State Representation
One complication of our approach outlined above is that very often the distribution ${\mathcal{F}}_{i}$ cannot distinguish itself from other distributions (e.g., ${\mathcal{F}}_{j},j\ne i$) during learning since they all depend on the same state feature input. This brings difficulties in maximizing disentanglement by jointly training as different distribution functions are exchangeable. A naive idea is to split the state feature $\psi (x)$ into $N$ pieces (e.g., $\psi {(x)}_{1}$, $\psi {(x)}_{2}$, …, $\psi {(x)}_{N}$) so that each distribution depends on different substatefeatures. However, we empirically found that this method is not enough to help learn good disentangled subreturns.
To address this problem, we utilize an idea similar to universal value function approximation (UVFA) (schaul2015universal). The key idea is to take onehot embedding as an additional input to condition the categorical distribution function, and apply the elementwise multiplication $\psi \odot \varphi $, to force interaction between state features and the onehot embedding feature:
$${\mathcal{F}}_{i}(x,a)={\mathcal{F}}_{{\theta}_{i}}{(\psi (x)\odot \varphi ({e}_{i}))}_{a},$$  (10) 
where ${e}_{i}$ denotes the onehot embedding where the $i$th element is one and $\varphi $ denotes onelayer nonlinear neural network that is updated by backpropagation during training.
In this way, the agent explicitly learns different distribution functions for different channels. The complete network architecture is shown in Figure 1.
4 Experiment Results
We tested our algorithm on the games from Arcade Learning Environment (ALE; bellemare2013arcade). We conduct experiments on six Atari games, some with complicated rules and some with simple rules. For our study, we implemented our algorithm based on Rainbow (hessel2018rainbow) which is an advanced variant of C51 (bellemare2017distributional) and achieved stateoftheart results in Atari games domain. We replace the update rule of Rainbow by Eq. 9 and network architecture of Rainbow by our convoluted architecture as shown in Figure 1. In Rainbow, the Qvalue is bounded by $[{V}_{min},{V}_{max}]$ where ${V}_{max}={V}_{min}=10$. In our method, we bound the categorical distribution of each subreturn ${Z}_{i}(i=1,2,\mathrm{\dots},N)$ by a range of $[\frac{{V}_{min}}{N},\frac{{V}_{max}}{N}]$. Rainbow uses a categorical distribution with $M=51$ atoms. For fair comparison, we assign $K=\lfloor \frac{M}{N}\rfloor $ atoms for the distribution of each subreturn, which results in the same network capacity as the original network architecture.
Our code is built upon dopamine framework (castro18dopamine). We use the default welltuned hyperparameter setting in dopamine. For our updating rule in Eq. 9, we set $\lambda =0.0001$. We run our agents for 100 epochs, each with 0.25 million of training steps and 0.125 million of evaluation steps. For evaluation, we follow common practice in van2016deep, starting the game with up to 30 noop actions to provide random starting positions for the agent. All experiments are performed on NVIDIA Tesla V100 16GB graphics cards.
4.1 Comparison with Rainbow
To verify that our architecture achieves reward decomposition without degraded performance, we compare our methods with Rainbow. However we are not able to compare our method with van2017hybrid and grimm2019learning since they require either predefined state preprocessing or specificstate resettable environments. We test our reward decomposition (RD) with 2 and 3 channels (e.g., RD(2), RD(3)). The results are shown in Figure 2. We found that our methods perform significantly better than Rainbow on the environments that we tested. This implies that our distributional reward decomposition method can help accelerate the learning process. We also discover that on some environments, RD(3) performs better than RD(2) while in the rest the two have similar performance. We conjecture that this is due to the intrinsic settings of the environments. For example, in Seaquest and UpNDown, the rules are relatively complicated, so RD(3) characterizes such complex reward better. However in simple environments like Gopher and Asterix, RD(2) and RD(3) obtain similar performance, and sometimes RD(2) even outperforms RD(3).
4.2 Reward Decomposition Analysis
Here we use Seaquest to illustrate our reward decomposition. Figure 3 shows the subrewards obtained by taking the expectation of the LHS of Eq.5 and the original reward along an actual trajectory. We observe that while ${r}_{1}=\mathbb{E}({R}_{1})$ and ${r}_{2}=\mathbb{E}({R}_{2})$ basically add up to the original reward $r$, ${r}_{1}$ dominates as the submarine is close to the surface, i.e. when it rescues the divers and refills oxygen. When the submarine scores by shooting sharks, ${r}_{2}$ becomes the main source of reward. We also monitor the distribution of different subreturns when the agent is playing game. In Figure 4 (a), the submarine floats to the surface to rescue the divers and refill oxygen and ${Z}_{1}$ has higher values. While in Figure 4 (b), as the submarine dives into the sea and shoots sharks, expected values of ${Z}_{2}$ (orange) are higher than ${Z}_{1}$ (blue). This result implies that the reward decomposition indeed captures different sources of returns, in this case shooting sharks and rescuing divers/refilling oxygen. We also provide statistics on actions for quantitative analysis to support the argument. In Figure 6(a), we count the occurrence of each action obtained with ${\mathrm{arg}\mathrm{max}}_{a}\mathbb{E}({Z}_{1})$ and ${\mathrm{arg}\mathrm{max}}_{a}\mathbb{E}({Z}_{2})$ in a single trajectory, using the same policy as in Figure 4. We see that while ${Z}_{1}$ prefers going up, ${Z}_{2}$ prefers going down with fire.
4.3 Visualization by saliency maps
To better understand the roles of different subrewards, we train a DRDRL agent with two channels (N=2) and compute saliency maps (simonyan2013deep). Specifically, to visualize the salient part of the images as seen by different subpolicies, we compute the absolute value of the Jacobian ${\nabla}_{x}{Q}_{i}(x,{\mathrm{arg}\mathrm{max}}_{{a}^{\prime}}Q(x,{a}^{\prime}))$ for each channel. Figure 5 shows that visualization results. We find that channel 1 (red region) focuses on refilling oxygen while channel 2 (green region) pays more attention to shooting sharks as well as the positions where sharks are more likely to appear.
4.4 Direct Control using Induced Subpolicies
We also provide videos^{1}^{1} 1 https://sites.google.com/view/drdpaper of running subpolicies defined by ${\pi}_{i}={\mathrm{arg}\mathrm{max}}_{a}\mathbb{E}({Z}_{i})$. To clarify, the subpolicies are never rolled out during training or evaluation and are only used to compute ${J}_{disentang}^{ij}$ in Eq. 6. We execute these subpolicies and observe its difference with the main policy $\pi ={\mathrm{arg}\mathrm{max}}_{a}\mathbb{E}({\sum}_{i=1}^{M}{Z}_{i})$ to get a better visual effect of the reward decomposition. Take Seaquest in Figure 6(b) as an example, two subpolicies show distinctive preference. As ${Z}_{1}$ mainly captures the reward for surviving and rescuing divers, ${\pi}_{1}$ tends to stay close to the surface. However ${Z}_{2}$ represents the return gained from shooting sharks, so ${\pi}_{2}$ appears much more aggressive than ${\pi}_{1}$. Also, without ${\pi}_{1}$ we see that ${\pi}_{2}$ dies quickly due to out of oxygen.
5 Related Work
Our method is closely related to previous work of reward decomposition. Reward function decomposition has been studied among others by russell2003q and sprague2003multiple. While these earlier works mainly focus on how to achieve optimal policy given the decomposed reward functions, there have been several recent works attempting to learn latent decomposed rewards. van2017hybrid construct an easytolearn value function by decomposing the reward function of the environment into $n$ different reward functions. To ensure the learned decomposition is nontrivial, van2017hybrid proposed to split a state into different pieces following domain knowledge and then feed different state pieces into each reward function branch. While such method can accelerate learning process, it always requires many predefined preprocessing techniques. There has been other work that explores learn reward decomposition network endtoend. grimm2019learning investigates how to learn independentlyobtainable reward functions. While it learns interesting reward decomposition, their method requires that the environments be resettable to specific states since it needs multiple trajectories from the same starting state to compute their objective function. Besides, their method aims at learning different optimal policies for each decomposed reward function. Different from the works above, our method can learn meaningful implicit reward decomposition without any requirements on prior knowledge. Also, our method can leverage the decomposed subrewards to find better behaviour for a single agent.
Our work also relates to Horde (sutton2011horde). The Horde architecture consists of a large number of ‘subagents’ that learn in parallel via offpolicy learning. Each demon trains a separate general value function (GVF) based on its own policy and pseudoreward function. A pseudoreward can be any featurebased signal that encodes useful information. The Horde architecture is focused on building up general knowledge about the world, encoded via a large number of GVFs. UVFA (schaul2015universal) extends Horde along a different direction that enables value function generalizing across different goals. Our method focuses on learning implicit reward decomposition in order to more efficiently learn a control policy.
6 Conclusion
In this paper, we propose Distributional Reward Decomposition for Reinforcement Learning (DRDRL), a novel reward decomposition algorithm which captures the multiple reward channel structure under distributional setting. Our algorithm significantly outperforms stateoftheart RL methods RAINBOW on Atari games with multiple reward channels. We also provide interesting experimental analysis to get insight for our algorithm. In the future, we might try to develop reward decomposition method based on quantile networks (dabney2018implicit; dabney2018distributional).
Acknowledgments
This work was supported in part by the National Key Research & Development Plan of China (grant No. 2016YFA0602200 and 2017YFA0604500), and by Center for High Performance Computing and System Simulation, Pilot National Laboratory for Marine Science and Technology (Qingdao).
References
Appendix
Nonfactorial model
To show that $P({Z}_{i}=v)=P({Z}_{i}=v{Z}_{j})$ is a reasonable assumption, we also implemented a nonfactorial version of DRDRL, i.e. assuming that $P({Z}_{i}=v)!=P({Z}_{i}=v{Z}_{j})$, with 3 channels. The architecture of the nonfactorial model is shown in figure 7. The three subdistributions have $K$ atoms, ${K}^{2}$ atoms and ${K}^{3}$ atoms respectively and they multiply to form a full distribution with ${K}^{3}$ atoms. To maintain similar network capacity as C51, we set $K=4$ to form a full distribution of 64 atoms.
Ablative results of the nonfactorial model are shown in the following section together with other tricks.
Ablative Analysis
Figure 8 shows how each component of our method contributes to DRDRL. Specifically, we test the performance of ‘1D convolution’, ‘1D convolution + KL’, ‘1Dconvolution + onehot’ and ‘1Dconvolution + KL + onehot’ as well as the nonfactorial model.
One may argue that the superior performance of DRDRL might come from the fact that fitting N simple subdistributions with M/N categories is easier than fitting one with M categories. We include another set of experiments to justify this argument by using M atoms instead of M/N atoms for each subdistribution.
Results show that every part of DRDRL is important. To our surprise, the KL term not only allows us to perform reward decomposition, we also observe significant differences between curves with and without the KL term. This suggests that learning decomposed reward can greatly boost the performance of RL algorithms. Together with the degraded performance of M atoms instead of M/N atoms in each subdistribution, it is suffice to suggest that the DRDRL’s success does not come from fitting easier distributions with M/N atoms.