Abstract
The successful application of general reinforcement learning algorithms torealworld robotics applications is often limited by their high datarequirements. We introduce Regularized Hierarchical Policy Optimization (RHPO)to improve dataefficiency for domains with multiple dominant tasks andultimately reduce required platform time. To this end, we employ compositionalinductive biases on multiple levels and corresponding mechanisms for sharingoffpolicy transition data across lowlevel controllers and tasks as well asscheduling of tasks. The presented algorithm enables stable and fast learningfor complex, realworld domains in the parallel multitask and sequentialtransfer case. We show that the investigated types of hierarchy enable positivetransfer while partially mitigating negative interference and evaluate thebenefits of additional incentives for efficient, compositional task solutionsin single task domains. Finally, we demonstrate substantial dataefficiency andfinal performance gains over competitive baselines in a weeklong, physicalrobot stacking experiment.
Quick Read (beta)
Compositional Transfer in
Hierarchical Reinforcement Learning
Abstract
The successful application of general reinforcement learning algorithms to realworld robotics applications is often limited by their high data requirements. We introduce Regularized Hierarchical Policy Optimization (RHPO) to improve dataefficiency for domains with multiple dominant tasks and ultimately reduce required platform time. To this end, we employ compositional inductive biases on multiple levels and corresponding mechanisms for sharing offpolicy transition data across lowlevel controllers and tasks as well as scheduling of tasks. The presented algorithm enables stable and fast learning for complex, realworld domains in the parallel multitask and sequential transfer case. We show that the investigated types of hierarchy enable positive transfer while partially mitigating negative interference and evaluate the benefits of additional incentives for efficient, compositional task solutions in single task domains. Finally, we demonstrate substantial dataefficiency and final performance gains over competitive baselines in a weeklong, physical robot stacking experiment.
.1cm
I Introduction
Creating realworld systems that learn to achieve many goals directly through interaction with their environment is one of the longstanding dreams in robotics. Although recent successes in deep (reinforcement) learning for computer games (Atari [28], StarCraft [55]), Go [44] and other simulated environments (e.g. [34]) have demonstrated the potential of these methods when large amounts of training data are available, the high cost of data acquisition has limited progress for many problems involving systems directly acting in the physical world.
Data efficiency in machine learning generally relies on inductive biases or prior knowledge to guide and accelerate the learning process. One strategy for injecting prior knowledge that is widely and successfully used in robotics learning problems is the use of human expert demonstrations to bootstrap the learning process. But the perspective of a system with a permanent embodiment capable of achieving many goals in a persistent environment provides us with a complementary opportunity: an efficient learning strategy should allow us to share and reuse experience across tasks – such that the system does not have to experience or learn the same thing multiple times, and such that solutions to simpler tasks can bootstrap the learning of harder ones.
Rather than providing prior knowledge or biases specific to a particular task this suggests focusing on more general inductive biases that facilitate the sharing and reuse of experience and knowledge across tasks while allowing other aspects of the domain to be learned [9]. Previous approaches to transfer learning have, for example, built on optimizing initial parameters [e.g. 13], sharing models and parameters across tasks either in the form of policies or value functions [e.g. 41, 51, 15], datasharing across tasks [e.g. 38, 5], or through the use of taskrelated auxiliary objectives [23, 57]. Transfer between tasks can, however, lead to either constructive or destructive transfer for humans [45] as well as for machines [35, 53]. That is, jointly learning to solve different tasks can provide both benefits and disadvantages for individual tasks, depending on their similarity. Finding a mechanism that enables transfer where possible but avoids interference is one of the longstanding research challenges.
In this paper, we propose a general reinforcement learning architecture that benefits from learning multiple tasks simultaneously and is sufficiently dataefficient and reliable to solve nontrivial manipulation tasks from scratch directly on robotics hardware. We achieve efficiency through three forms of transfer: (1) robust offpolicy learning allows to effectively share all generated transition data across tasks and skills; (2) a modular hierarchical policy architecture allows skills to be directly reused across tasks; and (3) switching between the execution of policies for different tasks within a single episode leads to effective exploration.
The model uses deep neural networks to parameterize stateconditional Gaussian mixture distributions as agent policies, similar to Mixture Density Networks [7]. To obtain robust and versatile lowlevel behaviors in the multitask setting we shield the mixture components from information about the task at hand. Task information is thus only communicated through the choice of mixture component by the highlevel controller, and the mixture components are trained as domaindependent but taskindependent skills. To efficiently optimize hierarchical policies in a multitask setting, we develop robust offpolicy learning schemes enabling us to use all transition data to train each lowlevel controller independent of the actually executed one. We focus on Maximum APosteriori Policy Optimization (MPO) [3] but also consider a variant of Stochastic Value Gradients (SVG) [20]. For both algorithms we employ trustregion like constraints at both levels of the hierarchy.
We evaluate the approach on several real and simulated robotics manipulation tasks and demonstrate that it outperforms competitive baselines. In particular, it dramatically improves data efficiency on a challenging realworld robotics manipulation task similar to the one considered in [38]: Our model learns to stack blocks from scratch on a single Sawyer robot arm within about a week at which point it demonstrates up to three times higher performance compared to our baselines. We further perform a number of careful ablations. These highlight, among others, the importance of the hierarchical architecture and the importance of the trustregion like constraints for the stability of the learning scheme. Finally, to gain a better understanding of the role of this type of hierarchy in RL, we compare its benefits in the single task and multitask setting. We find that it shows clear benefits advantages in the multitask setting. However, it can fail to improve performance in the singletask case, where additional incentives are required to encourage component specialization similar to the multitask case. These results shed further light on the interaction of model and domain in RL.
In summary, our contributions are as follows,

•
Algorithmic improvements: We propose a new method for robust and efficient offpolicy optimization of hierarchical policies. Our approach controls the rate of change at both levels of the hierarchy via trustregion like constraints thus ensuring stable learning. Furthermore, it can use all data to train any given lowlevel component, independent of the component which generated the transition. This enables data efficient training with experience replay and data sharing across tasks.

•
Performance improvements: We evaluate our approach on a range of real and simulated robotic manipulation domains. The results confirm that the algorithm scales to complex tasks and significantly reduces interaction time. Particular benefits arise in more complex task sets and the lowdata regime. When learning to stack from scratch on the Sawyer robot arm in a weeklong experiment, the approach demonstrates up to three times better performance for the most complex tasks.

•
Investigation of benefits, shortcomings and requirements: We perform a careful analysis and ablation of our algorithm and its properties, highlighting in particular, the impact of individual algorithmic and environment properties, as well was the overall robustness to hyperparameter settings.
II Preliminaries
We consider a multitask reinforcement learning setting with an agent operating in a Markov Decision Process (MDP) consisting of the state space $\mathcal{S}$, the action space $\mathcal{A}$, the transition probability $p({s}_{t+1}{s}_{t},{a}_{t})$ of reaching state ${s}_{t+1}$ from state ${s}_{t}$ when executing action ${a}_{t}$. The actions are drawn from a probability distribution over actions $\pi (as)$ referred to as the agent’s policy. Jointly, the transition dynamics and policy induce the marginal state visitation distribution $p(s)$. The discount factor $\gamma $ together with the reward $r(s,a)$ gives rise to the expected reward, or value, of starting in state $s$ (and following $\pi $ thereafter) ${V}^{\pi}(s)={\mathbb{E}}_{\pi}[{\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}r({s}_{t},{a}_{t}){s}_{0}=s,{a}_{t}\sim \pi (\cdot {s}_{t}),{s}_{t+1}\sim p(\cdot {s}_{t},{a}_{t})]$. We define multitask learning over a set of tasks $i\in I$ with common agent embodiment as follows: We assume shared state and action spaces and shared transition dynamics; tasks only differ in their reward function ${r}_{i}(s,a)$. We consider task conditional policies $\pi (as,i)$ with the overall objective defined as
$\begin{array}{cc}\hfill J(\pi )& ={\mathbb{E}}_{i\sim I}\left[{\mathbb{E}}_{\pi ,p\left({s}_{0}\right)}[{\displaystyle \sum _{t=0}^{\mathrm{\infty}}}{\gamma}^{t}{r}_{i}({s}_{t},{a}_{t}){s}_{t+1}\sim p(\cdot {s}_{t},{a}_{t})]\right]\hfill \\ & ={\mathbb{E}}_{i\sim I}\left[{\mathbb{E}}_{\pi ,p\left(s\right)}\left[{Q}^{\pi}(s,a,i)\right]\right],\hfill \end{array}$ 
where all actions are drawn according to the policy $\pi $ conditioned on task $i$, that is, ${a}_{t}\sim \pi (\cdot {s}_{t},i)$ and we used the following definition of the taskconditional stateaction value function (Equation 1).
$\begin{array}{c}\hfill {Q}^{\pi}(s,a,i)={\mathbb{E}}_{\pi}[{\displaystyle \sum _{t=0}^{\mathrm{\infty}}}{\gamma}^{t}{r}_{i}({s}_{t},{a}_{t}){a}_{0}=a,\\ \hfill {s}_{0}=s,{a}_{t}\sim \pi (\cdot {s}_{t},i),{s}_{t+1}\sim p(\cdot {s}_{t},{a}_{t})]\end{array}$  (1) 
III Method
This section introduces Regularized Hierarchical Policy Optimization (RHPO) which focuses on efficient training of modular policies by sharing data across tasks. We first describe the underlying class of mixture policies, followed by details on the criticweighted maximum likelihood optimization objective used to update structured hierarchical policies in a multitask, offpolicy setting. For efficiency in the multitask case, RHPO extends datasharing and scheduling mechanisms from Scheduled Auxiliary Control with randomized scheduling (SACU) [38].
IIIA Hierarchical Policies
We start by defining the hierarchical policy class which supports sharing subpolicies across tasks. Formally, we decompose the pertask policy $\pi (as,i)$ as
${\pi}_{\theta}(as,i)={\displaystyle \sum _{o=1}^{M}}{\pi}_{\theta}^{L}\left(as,o\right){\pi}_{\theta}^{H}\left(os,i\right),$  (2) 
where ${\pi}^{H}$ and ${\pi}^{L}$ respectively represent a highlevel switching controller (a categorical distribution) and a lowlevel subpolicy (components of the resulting mixture distribution), and $o$ is the index of the subpolicy. $\theta $ denotes the parameters of both ${\pi}^{H}$ and ${\pi}^{L}$, which we seek to optimize. While the number of components has to be decided externally, RHPO is robust with respect to this parameter (Appendix AG3). Note that in the above formulation only the highlevel controller ${\pi}_{H}$ is conditioned on the task information $i$. This choice introduces a form of information asymmetry [15, 52, 21] that enables the lowlevel policies to acquire general, taskindependent behaviours. This choice strengthens the decomposition of tasks across domains and prevents degenerate solutions that bypass the highlevel controller. Intuitively, these subpolicies can be understood as building reflexlike lowlevel control loops, which perform domaindependent but taskindependent behaviours and can be modulated by higher cognitive functions with knowledge of the task at hand. Figure 2 illustrates the used hierarchical policy architecture.
IIIB Dataefficient Multitask Policy Optimization
In the following sections, we present the core principles underlying RHPO; for the complete pseudocode algorithm please see Algorithm IIIB and Appendix AB1. We build on an ExpectationMaximization based policy optimization algorithm (similar to MPO [2]) and adapt it to the application to hierarchical policies in the multitask case. We update the parametric policy in 2 stages and decouple the policy improvement step from the fitting of the parametric policy.
We begin by describing the policy improvement steps below, assuming that we have an approximation of the groundtruth stateaction value function $\widehat{Q}(s,a,i)\approx {Q}^{\pi}(s,a,i)$ available (see Equation (7) for details on learning $\widehat{Q}$ from offpolicy data). Starting from an initial policy ${\pi}_{{\theta}_{0}}$ we can then iterate the following steps to improve the policy ${\pi}_{{\theta}_{k}}$:

[leftmargin=*]

Policy Evaluation: Update $\widehat{Q}$ such that $\widehat{Q}(s,a,i)\approx {\widehat{Q}}^{{\pi}_{{\theta}_{k}}}(s,a,i)$, see Equation (7).

Policy Improvement:

–
Step 1: Obtain ${q}_{k}=\mathrm{arg}{\mathrm{max}}_{q}J(q)$, under $\mathrm{KL}$ constraints with ${\pi}_{ref}={\pi}_{{\theta}_{k}}$ (Equation (3)).

–
Step 2: Obtain
${\theta}_{k+1}=\mathrm{arg}{\mathrm{min}}_{\theta}{\mathbb{E}}_{s\sim \mathcal{D},i\sim I}\left[\mathrm{KL}({q}_{k}(\cdot s,i)\parallel {\pi}_{\theta}(\cdot s,i))\right]$, under additional regularization (Equation (6)).

–
Policy Improvement 1: Obtaining Nonparametric Policies
Concretely, we first introduce an intermediate nonparametric policy $q(as,i)$ and optimize $J(q)$ while staying close, in expectation, to a reference policy ${\pi}_{ref}(as,i)$
$\begin{array}{cc}\hfill \underset{q}{\mathrm{max}}J(q)={\mathbb{E}}_{i\sim I}& \left[{\mathbb{E}}_{q,s\sim \mathcal{D}}\left[\widehat{Q}(s,a,i)\right]\right],\hfill \\ \hfill \text{s.t.}{\mathbb{E}}_{s\sim \mathcal{D},i\sim I}& \left[\mathrm{KL}(q(\cdot s,i)\parallel {\pi}_{ref}(\cdot s,i))\right]\le \u03f5,\hfill \end{array}$  (3) 
where $\mathrm{KL}(\cdot \parallel \cdot )$ denotes the Kullback Leibler divergence, $\u03f5$ defines a bound on the KL, $\mathcal{D}$ denotes the data contained in a replay buffer.
We find the intermediate policy $q$ by maximizing Equation (3) and can obtain a closedform solution with a nonparametric policy for each task, as
$${q}_{k}(as,i)\propto {\pi}_{{\theta}_{k}}(as,i)\mathrm{exp}\left(\frac{\widehat{Q}(s,a,i)}{\eta}\right),$$  (4) 
where $\eta $ is a temperature parameter (corresponding to a given bound $\u03f5$) that is obtained by optimizing the dual function,
$\begin{array}{cc}\hfill g(\eta )& =\eta \u03f5+\eta {\mathbb{E}}_{s\sim \mathcal{D},i\sim I}[\mathrm{log}({\displaystyle \int}{\pi}_{{\theta}_{k}}(as,i)\hfill \\ & \mathrm{exp}\left({\displaystyle \frac{\widehat{Q}(s,a,i)}{\eta}}\right)\mathrm{d}a)],\hfill \end{array}$  (5) 
(see Appendix AA1 for a detailed derivation of the dual function). This policy representation is independent of the form of the parametric policy ${\pi}_{{\theta}_{k}}$; i.e. $q$ only depends on ${\pi}_{{\theta}_{k}}$ through its requirement for obtaining samples. This, crucially, makes it easy to employ complicated structured policies (such as the one introduced in Section IIIA). The only requirement here, and in the following steps, is that we must be able to sample from ${\pi}_{{\theta}_{k}}$ and calculate the gradient (w.r.t. ${\theta}_{k}$) of its log density (but the sampling process itself need not be differentiable).
Policy Improvement 2: Fitting Parametric Policies
In the second step we fit a policy to the nonparametric distribution obtained from the previous calculation by minimizing the divergence ${\mathbb{E}}_{s\sim \mathcal{D},i\sim I}[\mathrm{KL}({q}_{k}(\cdot s,i)\parallel {\pi}_{\theta}(\cdot s,i))].$ Assuming that we can sample from ${q}_{k}$ this step corresponds to maximum likelihood estimation (MLE). Furthermore, we introduce a trustregion constraint on policy updates. In this way, we can regularize towards a target policy, effectively mitigating optimization instabilities. Trustregion constraints have been used in on and offpolicy RL [42, 2]. We adapt the formulation of [2] to our hierarchical setting, and as the analysis in Section IVA shows, it is critical for the success of our algorithm. Formally, we aim to obtain the solution in Equation 6, where ${\u03f5}_{m}$ defines a bound on the change of the new policy.
Here, we drop constant terms and the negative sign in the second line (turning $\mathrm{min}$ into $\mathrm{max}$), and explicitly insert the definition ${\pi}_{\theta}(as,i)={\sum}_{o=1}^{M}{\pi}_{L}\left(as,o\right){\pi}_{H}\left(os,i\right)$, highlighting that we are marginalizing over the highlevel choices in this fitting step. The update is independent of the specific policy component from which the action was sampled, enabling joint updates of all components. This reduces the variance of the update and also enables efficient offpolicy learning.
$$  (6) 
Different approaches can be used to control convergence for both the highlevel categorical choices and the action choices to change slowly throughout learning. The average KL constraint in Equation (6) is similar in nature to an upper bound on the computationally intractable KL divergence between the two mixture distributions and has been determined experimentally to perform better in practice than simple bounds. In practice, in order to control the change of the high level and low level policies independently we decouple the constraints to be able to set different $\u03f5$ for the means (${\u03f5}_{\mu}$), covariances (${\u03f5}_{\mathrm{\Sigma}}$) and the categorical distribution (${\u03f5}_{\alpha}$) in case of a mixture of Gaussian policy. To solve Equation (6), we first employ Lagrangian relaxation to make it amenable to gradient based optimization and then perform a fixed number of gradient descent steps (using Adam [25]); a detailed overview can be found in Algorithm IIIB as well as with further information in the Appendix AA2.
Policy Evaluation
For dataefficient offpolicy learning of $\widehat{Q}$ we experience sharing across tasks and switching between tasks within one episode for improved exploration by adapting the initial state distribution of each task based on other tasks [38].
Formally, we assume access to a replay buffer containing data gathered from all tasks. For each trajectory snippet $\tau =\{({s}_{0},{a}_{0},{R}_{0}),\mathrm{\dots},({s}_{L},{a}_{L},{R}_{L})\}$ we record the rewards for all tasks ${R}_{t}=[{r}_{{i}_{1}}({s}_{t},{a}_{t}),\mathrm{\dots},{r}_{{i}_{I}}({s}_{t},{a}_{t})]$ as a vector in the buffer. Using this data we define the retrace objective for learning $\widehat{Q}$, parameterized via $\varphi $, following [31, 38] as
$\begin{array}{c}\hfill \underset{\varphi}{\mathrm{min}}L(\varphi )={\displaystyle \sum _{i\sim I}}{\mathbb{E}}_{\tau \sim \mathcal{D}}[({r}_{i}({s}_{t},{a}_{t})+\\ \hfill {\gamma {Q}^{ret}({s}_{t+1},{a}_{t+1},i){\widehat{Q}}_{\varphi}({s}_{t},{a}_{t},i))}^{2}],\end{array}$  (7) 
where ${Q}^{ret}$ is the Lstep retrace target [31], see the Appendix AB2 for details.
[t] \[email protected]@algorithmic \STATEInput: ${N}_{steps}$ number of update steps, ${N}_{\text{targetUpdate}}$ update steps between target update, ${N}_{s}$ number of action samples per state, KL regularization parameters $\u03f5$, initial parameters for $\pi ,\eta $ and $\varphi $ \STATEinitialize N = 0 \WHILE$k\le {N}_{\text{steps}}$ \FOR$k$ in $[0\mathrm{\dots}{N}_{\text{targetUpdate}}]$ \STATEsample a batch of trajectories $\tau $ from replay buffer $B$ \STATEsample ${N}_{s}$ actions from ${\pi}_{{\theta}_{k}}$ to estimate expectations below \STATE// compute mean gradients over batch for policy, Lagrangian multipliers and Qfunction \STATE${\delta}_{\pi}\leftarrow {\nabla}_{\theta}{\sum}_{{s}_{t}\in \tau}{\sum}_{j=1}^{{N}_{s}}[\mathrm{exp}\left(\frac{Q({s}_{t},{a}_{j},i)}{\eta}\right)$ \STATE\hfill$\mathrm{log}{\pi}_{\theta}({a}_{j}{s}_{t},i)]$ following Eq. 6 \STATE${\delta}_{\eta}\leftarrow {\nabla}_{\eta}g(\eta )={\nabla}_{\eta}\eta \u03f5+\eta {\sum}_{{s}_{t}\in \tau}\mathrm{log}\frac{1}{{N}_{s}}{\sum}_{j=1}^{{N}_{s}}[$ \STATE\hfill$\mathrm{exp}\left(\frac{Q({s}_{t},{a}_{j},i)}{\eta}\right)]$ following Eq. 5 \STATE${\delta}_{Q}\leftarrow {\nabla}_{\varphi}{\sum}_{i\sim I}{\sum}_{({s}_{t},{a}_{t})\in \tau}{\left({\widehat{Q}}_{\varphi}({s}_{t},{a}_{t},i){Q}^{\text{ret}}\right)}^{2}$ \STATE\hfillwith ${Q}^{\text{ret}}$ following Eq. 7 \STATE// apply gradient updates \STATE${\pi}_{{\theta}_{k+1}}=$ optimizer_update($\pi $, ${\delta}_{\pi}$), \STATE$\eta =$ optimizer_update($\eta $, ${\delta}_{\eta}$) \STATE${\widehat{Q}}_{\varphi}=$ optimizer_update(${\widehat{Q}}_{\varphi}$, ${\delta}_{Q}$) \STATE$k=k+1$ \ENDFOR\STATE// update target networks \STATE${\pi}^{\prime}=\pi $, ${Q}^{\prime}=Q$ \ENDWHILE
IV Experiments
In the following sections, we investigate the effects of training hierarchical policies in single and multitask domains. In particular, we demonstrate that RHPO can provide compelling benefits for multitask learning in real and simulated robotic manipulation tasks and significantly reduce platform interaction time. For the final experiment, a stacking task on a physical Sawyer robot arm, RHPO achieves a dramatic performance improvement after a week of training compared to several strong baselines. We further investigate RHPO in a sequential transfer setting and find that when pretrained skills (i.e. lowlevel components) are available RHPO can provide additional improvements in data efficiency.
Finally, we perform a number of ablations to emphasize the importance of trustregion constraints for the highlevel controller and to understand the relative role of hierarchy in the singletask and multitask setting: In the singletask case, using domains from the DeepMind Control Suite [49], we first demonstrate that our hierarchy on its own can fail to improve performance and that for the model to exploit compositionality in this setting, additional incentives for component specialization are required.
For all tasks and algorithms, we use a distributed actorcritic framework (similar to [12]) with flexible hardware assignment [8]. We perform critic and policy updates from a replay buffer, which is asynchronously filled by a set of actors. In all figures with error bars, we visualize mean and variance derived from 3 runs. Additional details of task hyperparameters as well as results for ablations and the full set of tasks from the multitask domains are provided in the Appendix AD. ^{*}^{*} * Additional details and the appendix can be found under https://sites.google.com/corp/view/rhpo
IVA Simulated Multitask Experiments
We use three simulated multitask scenarios with the Kinova Jaco and Rethink Robotics Sawyer robot arms to test in a variety of conditions. The three scenarios each consist of tasks of different difficulties and vary in their overall complexity. The least difficult scenario is Pile1: Here, the seven tasks of interest range from simple reaching for a block over tasks like grasping it, to the final task of stacking the block on top of another block. The two more difficult scenarios are Pile2 and Cleanup2. Pile2 includes stacking with both blocks on top of the respective other block, resulting in a setting with 10 tasks. Cleanup2 includes harder tasks such as opening a box and placing blocks into this box, consisting of a total of 13 tasks. In addition to the experiments in simulation, which are executed with 5 actors in a distributed setting, we also investigate the Pile1 multitask domain (same rewards and setup) on a single, physical robot in Section IVB.
Our main comparison evaluates RHPO with hierarchical policies against SAC [38] with a flat, monolithic policy shared across all tasks which is provided with the additional task id as input (displayed as SACUMonolithic) as well as policies with task dependent heads (displayed as SACUIndependent) following [38] – both using MPO as the optimizer. Furthermore, we compare against a reimplementation of SAC using SVG [20] as actorcritic based optimizer which uses the reparameterization trick (displayed as SACU[SVG]). In order to compare with gradientbased hierarchical policy updates (such option critic [6]) as well as investigating the application of the proposed hierarchical model for other RL algorithms; we also use SVG (with continuous relaxation of the Categorical distribution [27, 24]) instead of MPO to optimize the hierarchical model with results included in the Pile1 experiments (displayed as RHPO[SVG]) with additional results in Appendix AH. These comparisons furthermore strengthen our choice for criticweighted likelihood instead of reparametrization gradientbased policy optimizer.
The main SAC baselines provide the two opposite, naive perspectives on transfer: by using the same monolithic policy across tasks we enable positive as well as negative interference and independent policies prevent policybased transfer. After experimentally confirming the robustness of RHPO with respect to the number of lowlevel subpolicies (see Appendix AG3), we set $M$ proportional to the number of tasks in each domain.
Figure 3 demonstrates that RHPO outperforms the monolithic as well as the independent baselines (based on SAC). For simple tasks such as the Pile1 domain, the difference is small, but as the number of tasks grows and the complexity of the domain increases (cf. Pile2 and Cleanup2), the advantage of composing learned behaviours across tasks becomes more significant. We further observe that using MPO instead of SVG [20] as policy optimizer results in an improvement for the baselines. This effect becomes more pronounced for the hierarchical policies.
IVB Physical Robot Experiments
For realworld experiments, dataefficiency is crucial. We perform all experiments in this section relying on a single robot (single actor) – demonstrating the benefits of RHPO in the low data regime. The performed task is the real world version of the Pile1 task described in Section IVA. Given the higher cost of experiment time, the robot experiments additionally emphasize the requirements for hyperparameter robust algorithms which is further investigated in Section IVE.
The setup for the experiments consists of a Sawyer robot arm mounted on a table, equipped with a Robotiq 2F85 parallel gripper. A basket of size $20{\text{cm}}^{2}$ in front of the robot contains the three cubes. Three cameras on the basket track the cubes using fiducials (augmented reality tags). As in simulation, the agent is provided with proprioception information (joint positions, velocities and torques), a wrist sensor’s force and torque readings, as well as the cubes’ poses – estimated via the fiducials. The agent’s action is five dimensional and consists of the three Cartesian translational velocities, the angular velocity of the wrist around the vertical axis and the speed of the gripper’s fingers.
Figure 4 plots the learning progress on the real robot for four of the tasks, from simpler reach and lift tasks and the stack and final stackandleave task – which is the main task of interest. Plots for the learning progress of all tasks are given in the appendix AF. As can be observed, all methods manage to learn the reach task quickly (within about a few thousand episodes) but only RHPO with a hierarchical policy is able to learn the stacking task (taking about 15 thousand episodes to obtain good stacking success), which takes about 8 days of training on the real robot with considerably slower progress for all baselines taking multiple weeks for completion.
To provide further insight into the learned representation we compute distributions for each component over the tasks which activate it, as well as distributions for each task over which components are being used. For each set of distributions, we determine the Battacharyya distance metric to determine the similarity between tasks and the similarity between components in Figure 5 (right). The plots demonstrate how the components specialize, but also provide a way to investigate our tasks, showing e.g. that the first reach task is fairly independent and that the last four tasks are comparably similar regarding the highlevel components applied for their solution.
Task Similarity  Component Similarity 
IVC Sequential Transfer
RHPO is well suited for sequential transfer learning as it allows to use pretrained lowlevel components to solve new tasks. To investigate performance in adapting pretrained multitask policies to novel tasks, we train agents to fulfill all but the final task in the Pile1 and Cleanup2 domains and subsequently evaluate training the models on the final task. We consider two settings for the final policy: we either introduce only a new highlevel controller (SequentialOnlyHL) or both an additional shared component as well as a new highlevel controller (Sequential). Figure 6 displays that in the sequential transfer setting, starting from a policy trained on a set of related tasks results in up to 5 times more dataefficiency in terms of actor episodes on the final task than training the same policy from scratch. We observe that the final task can be solved by only reusing lowlevel components from previous tasks if the final task is the composition of previous tasks. This is the case for the final task in Cleanup2 which can be completed by sequencing the previously learned components and in contrast to Pile1 where the final letting go of the block after stacking is not required for earlier tasks.
IVD Simulated Single Task Experiments
We consider two highdimensional tasks for continuous control: humanoidrun and humanoidstand from Tassa et al. [49] and compare MPO with a flat Gaussian policy to RHPO with a mixture of Gaussians with three components. Figure 7 shows the results in terms of the number of episodes.
When both the flat and hierarchical policies are initialized with means close to zero, RHPO performs comparable to a flat policy and learns similar means and variances for all components as the model fails to decompose the learned behavior. If, however, the hierarchical policy is initialized with distinct means for different components (here, for the three components ranging for all dimensions from the minimum to maximum of the allowed action range, i.e. 1, 1), we observe significantly improved performance and component specialization.
IVE Performance Ablations
We perform a series of ablations based on the earlier introduced Pile1 domain, providing additional insights into benefits, shortcomings and relevant hyperparameters of RHPO.
First, we display the importance of choice of regularization in Figure 8 with complete results in Appendix AG1. We are able to demonstrate the effect of weakening the constraint by setting the epsilon value higher (here: to 1.). This setting prevents convergence of the policy to capable solutions and emphasizes the necessity of constraining the update steps. In addition, very small values can slow down convergence. However, in the present experiments a range of about 2 orders of magnitude results in good performance.
We additionally ablate over the number of datagenerating actors in Figure 9 to evaluate all approaches with respect to data rate and illustrate how RHPO is particularly relevant at lower data rates such as given by realworld robotics applications (with results for all tasks in Appendix AG2). Here, RHPO always provides stronger final performance and learns significantly faster in one actor case as common for robot experiments. By running with multiple actors, we increase the rate of data generation such that in asynchronous settings, the speed of the learner becomes more important. Since training our hierarchical policies is computationally slightly more costly, the benefits become smaller for easier tasks^{†}^{†} † In asynchronous RL systems, the update rate of the learner can have a significant impact on the performance when evaluated over data generated. .
Reach 
Lift 
Stack 
V Related Work
Transfer learning, in particular in the multitask context, has long been part of machine learning (ML) for datalimited domains [9, 53, 35, 50]. Commonly, it is not straightforward to train a single model jointly across different tasks as the solutions to tasks might not only interfere positively but also negatively [56]. Preventing this type of forgetting or negative transfer presents a challenge for biological [45] as well as artificial systems [14]. In the context of ML, a common scheme is the reduction of representational overlap [14, 41, 56]. Bishop [7] utilize neural networks to parametrize mixture models for representing multimodal distributions thus mitigating shortcomings of nonhierarchical approaches. Rosenstein et al. [40] demonstrate the benefits of hierarchical classification models to limit the impact of negative transfer.
Hierarchical approaches have a long history in the reinforcement learning literature [e.g. 48, 11]. Prior work commonly benefits from combining hierarchy with additional inductive biases such as [54, 33, 32, 58] which employ different rewards for different levels of the hierarchy rather than optimizing a single objective for the entire model as we do. Other works have shown the additional benefits for the stability of training and dataefficiency when sequences of highlevel actions are given as guidance during optimization in a hierarchical setting [43, 4, 52]. Instead of introducing additional training signals, we directly investigate the benefits of compositional hierarchy as provided structure for transfer between tasks.
Hierarchical models for probabilistic trajectory modelling have been used for the discovery of behavior abstractions as part of an endtoend reinforcement learning paradigm [e.g. 51, 22, 52, 15] where the models act as learned inductive biases that induce the sharing of behavior across tasks. In a vein similar to the presented algorithm, [e.g 21, 52] share a lowlevel controller across tasks but modulate the lowlevel behavior via a continuous embedding rather than picking from a small number of mixture components. In related work [19, 16] learn hierarchical policies with continuous latent variables optimizing the entropy regularized objective.
Similar to our work, the options framework [48, 36] supports behavior hierarchies, where the higher level chooses from a discrete set of subpolicies or “options” which commonly are run until a termination criterion is satisfied. The framework focuses on the notion of temporal abstraction. A number of works have proposed practical and scalable algorithms for learning option policies with reinforcement learning [e.g. 6, 59, 46, 39, 17] or criteria for option induction [e.g. 17, 18]. Rather than the additional inductive bias of temporal abstraction, we focus on the investigation of composition as type of hierarchy in the context of single and multitask learning while demonstrating the strength of hierarchical composition to lie in domains with strong variation in the objectives  such as in multitask domains. We additionally introduce a hierarchical extension of SVG [20], to investigate similarities to work on the option critic [6].
With the use of KL regularization to different ends in RL, work related to RHPO focuses on contextual bandits [10]. The algorithm builds on a 2step EM like procedure to optimize linearly parametrized mixture policies. However, their algorithm has been used only with low dimensional policy representations, and in contextual bandit and other very short horizon settings. Our approach is designed to be applicable to full RL problems in complex domains with long horizons and with highcapacity function approximators such as neural networks. This requires robust estimation of value function approximations, offpolicy correction, and additional regularization for stable learning.
VI Discussion
We introduce RHPO, a novel algorithm for robust training of hierarchical policies in multitask settings. RHPO consistently outperforms competitive baselines which either handle tasks independently or implicitly share experience by reusing data across tasks. Especially for complex tasks or in a low data regime, as encountered in robotics applications, we strongly reduce the number of environment interactions and improve final performance as well as learning robustness and sensitivity to hyperparameters. Our results show that the algorithm scales to complex, realworld domains and provides an important step towards the deployment of RL algorithms on robotic systems.
Algorithmically, our method highlights the importance of trustregionlike regularization for stable optimization of hierarchical policies. Furthermore, our update rules in combination with mixture policies and hindsight reward assignments enable training for any task and skill independent of the data source. This enables efficient learning of the hierarchical policies in an offpolicy setting, which is important for data efficient learning.
Conceptually, our results demonstrate that hierarchical policies can be an effective way of sharing skills or behavior components across tasks, both in multitask (Sections IVAIVB) as well as in transfer settings (Section IVC) and partially mitigate negative interference between tasks in the parallel multitask learning scenario. Furthermore, we find that their benefits are complementary to offpolicy sharing of transition data across tasks (e.g. SACX [38], HER [5]). Valuable directions for future work include the direct extension to multilevel hierarchies and the identification of basis sets of behaviours which perform well on wide ranges of possible tasks given a known domain.
We believe that especially in domains with consistent agent embodiment and high costs for data generation learning tasks jointly and information sharing is imperative. Our results suggest that a system that is exposed to a rich set of tasks or experiences and has appropriate means for reusing knowledge can learn to solve nontrivial problems directly from interaction with its environment. RHPO combines several ideas that we believe will be important: sharing data across tasks and skills across tasks with compositional policy representations, robust optimization, and efficient offpolicy learning. Although we have found this particular combination of components to be very effective we believe it is just one instance of – and step towards – a spectrum of efficient learning architectures that will unlock further applications of RL both in simulation and, more importantly, on physical hardware.
Acknowledgments
The authors would like to thank Michael Bloesch, Jonas Degrave, Joseph Modayil and Doina Precup for helpful discussion and relevant feedback for shaping our submission. As robotics (and AI research) is a team sport we’d additionally like to acknowledge the support of Francesco Romano, Murilo Martins, Stefano Salicetti, Tom Rothörl and Francesco Nori on the hardware and infrastructure side as well as many others of the DeepMind team.
References
 Abadi et al. [2015] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
 Abdolmaleki et al. [2018a] Abbas Abdolmaleki, Jost Tobias Springenberg, Jonas Degrave, Steven Bohez, Yuval Tassa, Dan Belov, Nicolas Heess, and Martin Riedmiller. Relative entropy regularized policy iteration. arXiv preprint arXiv:1812.02256, 2018a.
 Abdolmaleki et al. [2018b] Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Rémi Munos, Nicolas Heess, and Martin A. Riedmiller. Maximum a posteriori policy optimisation. CoRR, abs/1806.06920, 2018b.
 Andreas et al. [2017] Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with policy sketches. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 166–175. JMLR. org, 2017.
 Andrychowicz et al. [2017] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058, 2017.
 Bacon et al. [2017] PierreLuc Bacon, Jean Harb, and Doina Precup. The optioncritic architecture. In ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
 Bishop [1994] Christopher M Bishop. Mixture density networks. Technical report, Citeseer, 1994.
 Buchlovsky et al. [2019] Peter Buchlovsky, David Budden, Dominik Grewe, Chris Jones, John Aslanides, Frederic Besse, Andy Brock, Aidan Clark, Sergio Gómez Colmenarejo, Aedan Pope, Fabio Viola, and Dan Belov. Tfreplicator: Distributed machine learning for researchers. arXiv preprint arXiv:1902.00465, 2019.
 Caruana [1997] Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
 Daniel et al. [2016] Christian Daniel, Gerhard Neumann, Oliver Kroemer, and Jan Peters. Hierarchical relative entropy policy search. The Journal of Machine Learning Research, 17(1):3190–3239, 2016.
 Dayan and Hinton [1993] Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. In Advances in neural information processing systems, pages 271–278, 1993.
 Espeholt et al. [2018] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deeprl with importance weighted actorlearner architectures. In International Conference on Machine Learning, pages 1406–1415, 2018.
 Finn et al. [2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1126–1135. JMLR. org, 2017.
 French [1999] Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
 Galashov et al. [2018] Alexandre Galashov, Siddhant M Jayakumar, Leonard Hasenclever, Dhruva Tirumala, Jonathan Schwarz, Guillaume Desjardins, Wojciech M Czarnecki, Yee Whye Teh, Razvan Pascanu, and Nicolas Heess. Information asymmetry in klregularized rl. 2018.
 Haarnoja et al. [2018] Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine. Latent space policies for hierarchical reinforcement learning. In International Conference on Machine Learning, pages 1846–1855, 2018.
 Harb et al. [2018] Jean Harb, PierreLuc Bacon, Martin Klissarov, and Doina Precup. When waiting is not an option: Learning options with a deliberation cost. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 Harutyunyan et al. [2019] Anna Harutyunyan, Will Dabney, Diana Borsa, Nicolas Heess, Rémi Munos, and Doina Precup. The termination critic. CoRR, abs/1902.09996, 2019. URL http://arxiv.org/abs/1902.09996.
 Hausman et al. [2018] Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Nicolas Heess, and Martin Riedmiller. Learning an embedding space for transferable robot skills. In International Conference on Learning Representations, 2018.
 Heess et al. [2015] Nicolas Heess, Gregory Wayne, David Silver, Tim Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, 2015.
 Heess et al. [2016] Nicolas Heess, Greg Wayne, Yuval Tassa, Timothy Lillicrap, Martin Riedmiller, and David Silver. Learning and transfer of modulated locomotor controllers. arXiv preprint arXiv:1610.05182, 2016.
 Igl et al. [2019] Maximilian Igl, Andrew Gambardella, Nantas Nardelli, N Siddharth, Wendelin Böhmer, and Shimon Whiteson. Multitask soft option learning. arXiv preprint arXiv:1904.01033, 2019.
 Jaderberg et al. [2016] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
 Jang et al. [2016] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.
 Kingma and Ba [2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma and Welling [2014] Diederik P Kingma and Max Welling. Autoencoding variational bayes. In ICLR, 2014.
 Maddison et al. [2016] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Mohamed et al. [2019] Shakir Mohamed, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. Monte carlo gradient estimation in machine learning. arXiv preprint arXiv:1906.10652, 2019.
 Munos et al. [2016] Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems, 2016.
 Nachum et al. [2018a] Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Nearoptimal representation learning for hierarchical reinforcement learning. arXiv preprint arXiv:1810.01257, 2018a.
 Nachum et al. [2018b] Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Dataefficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pages 3303–3313, 2018b.
 OpenAI et al. [2018] OpenAI, Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Józefowicz, Bob McGrew, Jakub W. Pachocki, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, Jonas Schneider, Szymon Sidor, Josh Tobin, Peter Welinder, Lilian Weng, and Wojciech Zaremba. Learning dexterous inhand manipulation. CoRR, abs/1808.00177, 2018. URL http://arxiv.org/abs/1808.00177.
 Pan and Yang [2010] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
 Precup [2000] Doina Precup. Temporal abstraction in reinforcement learning. University of Massachusetts Amherst, 2000.
 Reynolds et al. [2017] Malcolm Reynolds, Gabriel BarthMaron, Frederic Besse, Diego de Las Casas, Andreas Fidjeland, Tim Green, Adrià Puigdomènech, Sébastien Racanière, Jack Rae, and Fabio Viola. Open sourcing Sonnet  a new library for constructing neural networks. https://deepmind.com/blog/opensourcingsonnet/, 2017.
 Riedmiller et al. [2018] Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Van de Wiele, Volodymyr Mnih, Nicolas Heess, and Jost Tobias Springenberg. Learning by playingsolving sparse reward tasks from scratch. arXiv preprint arXiv:1802.10567, 2018.
 Riemer et al. [2018] Matthew Riemer, Miao Liu, and Gerald Tesauro. Learning abstract options. In Advances in Neural Information Processing Systems, pages 10424–10434, 2018.
 Rosenstein et al. [2005] Michael T Rosenstein, Zvika Marx, Leslie Pack Kaelbling, and Thomas G Dietterich. To transfer or not to transfer. In NIPS 2005 workshop on transfer learning, volume 898, pages 1–4, 2005.
 Rusu et al. [2016] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
 Schulman et al. [2015] John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust region policy optimization. In Proceedings of the 32nd International Conference on International Conference on Machine LearningVolume 37, pages 1889–1897. JMLR. org, 2015.
 Shiarlis et al. [2018] Kyriacos Shiarlis, Markus Wulfmeier, Sasha Salter, Shimon Whiteson, and Ingmar Posner. Taco: Learning task decomposition via temporal alignment for control. In International Conference on Machine Learning, pages 4661–4670, 2018.
 Silver et al. [2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
 Singley and Anderson [1989] Mark K Singley and John Robert Anderson. The transfer of cognitive skill. Number 9. Harvard University Press, 1989.
 Smith et al. [2018] Matthew Smith, Herke Hoof, and Joelle Pineau. An inferencebased policy gradient method for learning options. In International Conference on Machine Learning, pages 4710–4719, 2018.
 Sutton [1988] Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
 Sutton et al. [1999] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211, 1999.
 Tassa et al. [2018] Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
 Taylor and Stone [2009] Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(Jul):1633–1685, 2009.
 Teh et al. [2017] Yee Whye Teh, Victor Bapst, Wojciech Marian Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. CoRR, abs/1707.04175, 2017. URL http://arxiv.org/abs/1707.04175.
 Tirumala et al. [2019] Dhruva Tirumala, Hyeonwoo Noh, Alexandre Galashov, Leonard Hasenclever, Arun Ahuja, Greg Wayne, Razvan Pascanu, Yee Whye Teh, and Nicolas Heess. Exploiting hierarchy for learning and transfer in klregularized rl. arXiv preprint arXiv:1903.07438, 2019.
 Torrey and Shavlik [2010] Lisa Torrey and Jude Shavlik. Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pages 242–264. IGI Global, 2010.
 Vezhnevets et al. [2017] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3540–3549. JMLR. org, 2017.
 Vinyals et al. [2019] Oriol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jaderberg, Wojciech M. Czarnecki, Andrew Dudzik, Aja Huang, Petko Georgiev, Richard Powell, Timo Ewalds, Dan Horgan, Manuel Kroiss, Ivo Danihelka, John Agapiou, Junhyuk Oh, Valentin Dalibard, David Choi, Laurent Sifre, Yury Sulsky, Sasha Vezhnevets, James Molloy, Trevor Cai, David Budden, Tom Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaff, Toby Pohlen, Yuhuai Wu, Dani Yogatama, Julia Cohen, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy Lillicrap, Chris Apps, Koray Kavukcuoglu, Demis Hassabis, and David Silver. AlphaStar: Mastering the RealTime Strategy Game StarCraft II. https://deepmind.com/blog/alphastarmasteringrealtimestrategygamestarcraftii/, 2019.
 Wang et al. [2018] Zirui Wang, Zihang Dai, Barnabás Póczos, and Jaime Carbonell. Characterizing and avoiding negative transfer. arXiv preprint arXiv:1811.09751, 2018.
 Wulfmeier et al. [2017] Markus Wulfmeier, Ingmar Posner, and Pieter Abbeel. Mutual alignment transfer learning. arXiv preprint arXiv:1707.07907, 2017.
 Xie et al. [2018] Saining Xie, Alexandre Galashov, Siqi Liu, Shaobo Hou, Razvan Pascanu, Nicolas Heess, and Yee Whye Teh. Transferring task goals via hierarchical reinforcement learning, 2018. URL https://openreview.net/forum?id=S1Y6TtJvG.
 Zhang and Whiteson [2019] Shangtong Zhang and Shimon Whiteson. Dac: The double actorcritic architecture for learning options. arXiv preprint arXiv:1904.12691, 2019.
Appendix A Appendix
AA Additional Derivations
In this section we explain the detailed derivations for training hierarchical policies parameterized as a mixture of Gaussians.
AA1 Obtaining Nonparametric Policies
In each policy improvement step, to obtain nonparametric policies for a given state and task distribution, we solve the following program:
$\underset{q}{\mathrm{max}}{\mathbb{E}}_{\mu (s),i\sim I}\left[{\mathbb{E}}_{q(as,i)}[\widehat{Q}(s,a,i)]\right]$  
$$  
$s.t.{\mathbb{E}}_{\mu (s),i\sim I}\left[{\mathbb{E}}_{q(as)}\left[1\right]\right]=1.$ 
To make the following derivations easier to follow we open up the expectations, writing them as integrals explicitly. For this purpose let us define the joint distribution over states $s\sim \mu (s)$ together with randomly sampled tasks $i\sim I$ as $\mu (s,i)=p(s\mathcal{D})\mathcal{U}(i\in I)$, where $\mathcal{U}$ denotes the uniform distribution over possible tasks. This allows us to rewrite the expectations that include the corresponding distributions, i.e. ${\mathbb{E}}_{\mu (s),i\sim I}[1]={\mathbb{E}}_{\mu (s,i)}[1]$, but again, note that $i$ here is not necessarily the task under which $s$ was observed. We can then write the Lagrangian equation corresponding to the above described program as
$L(q,$  $\eta ,\gamma )={\displaystyle \int}\mu (s,i){\displaystyle \int}q(as,i)\widehat{Q}(s,a,i)\mathrm{d}a\mathrm{d}s\mathrm{d}i$  
$+\eta \left(\u03f5{\displaystyle \int \mu (s,i)\int q(as,i)\mathrm{log}\frac{q(as,i)}{\pi (as,i,{\theta}_{t})\mathrm{d}a\mathrm{d}s\mathrm{d}i}}\right)$  
$+\gamma \left(1{\displaystyle \iint \mu (s,i)q(as,i)dadsdi}\right).$ 
Next we maximize the Lagrangian $L$ w.r.t the primal variable $q$. The derivative w.r.t $q$ reads,
$\partial qL(q,\eta ,\gamma )=$  $\widehat{Q}(a,s,i)\eta \mathrm{log}q(as,i)$  
$+\eta \mathrm{log}\pi (as,i,{\theta}_{t})\eta \gamma .$ 
Setting it to zero and rearranging terms we obtain
$$q(as,i)=\pi (as,i,{\theta}_{t})\mathrm{exp}\left(\frac{\widehat{Q}(s,a,i)}{\eta}\right)\mathrm{exp}\left(\frac{\eta +\gamma}{\eta}\right).$$ 
However, the last exponential term is a normalization constant for $q$. Therefore we can write,
$$\mathrm{exp}\left(\frac{\eta +\gamma}{\eta}\right)=\int \pi (as,i,{\theta}_{t})\mathrm{exp}\left(\frac{Q(s,a,i)}{\eta}\right)da$$ 
$\frac{\eta +\gamma}{\eta}}=\mathrm{log}\left({\displaystyle \int \pi (as,i,{\theta}_{t})\mathrm{exp}\left(\frac{Q(s,a,i)}{\eta}\right)da}\right).$  (8) 
Now, to obtain the dual function $g(\eta )$, we plug in the solution to the KL constraint term (second term) of the Lagrangian which yields
$L(q,\eta ,\gamma )={\displaystyle \int \mu (s,i)\int q(as,i)Q(s,a,i)dads}$  
$+\eta (\u03f5{\displaystyle \int}\mu (s,i){\displaystyle \int}q(as,i)$  
$\mathrm{log}{\scriptscriptstyle \frac{\pi (as,i,{\theta}_{t})\mathrm{exp}\left(\frac{Q(s,a,i)}{\eta}\right)\mathrm{exp}\left(\frac{\eta +\gamma}{\eta}\right)}{\pi (as,i,{\theta}_{t})}}\mathrm{d}a\mathrm{d}s\mathrm{d}i)$  
$+\gamma \left(1{\displaystyle \iint \mu (s,i)q(as,i)dadsdi}\right).$ 
After expanding and rearranging terms we get
$L(q,\eta ,\gamma )={\displaystyle \int \mu (s,i)\int q(as,i)Q(s,a,i)dadsdi}$  
$\eta {\displaystyle \int}\mu (s,i){\displaystyle \int}q(as,i)[{\displaystyle \frac{Q(s,a,i)}{\eta}}+$  
$\mathrm{log}\pi (as,i;{\theta}_{t}){\displaystyle \frac{\eta +\gamma}{\eta}}]\mathrm{d}a\mathrm{d}s\mathrm{d}i$  
$+\eta \u03f5+\eta {\displaystyle \int \mu (s,i)\int q(as,i)\mathrm{log}\pi (as,i;{\theta}_{t})dadsdi}$  
$+\gamma \left(1{\displaystyle \iint \mu (s,i)q(as,i)dadsdi}\right).$ 
Most of the terms cancel out and after rearranging the terms we obtain
$L(q,\eta ,\gamma )=\eta \u03f5+\eta {\displaystyle \int \mu (s,i)\frac{\eta +\gamma}{\eta}dsdi}.$ 
Note that we have already calculated the term inside the integral in Equation 8. By plugging in equation 8 we obtain the dual function
$g(\eta )=\eta \u03f5+\eta {\displaystyle \int}\mu (s,i)\mathrm{log}({\displaystyle \int}\pi (as,i,{\theta}_{t})$  (9)  
$\mathrm{exp}\left({\displaystyle \frac{Q(s,a,i)}{\eta}}\right)\mathrm{d}a)\mathrm{d}s\mathrm{d}i,$ 
which we can minimize with respect to $\eta $ based on samples from the replay buffer.
AA2 Extended Update Rules For Fitting a Mixture of Gaussians
After obtaining the non parametric policies, we fit a parametric policy to samples from said nonparametric policies – effectively employing using maximum likelihood estimation with additional regularization based on a distance function $\mathcal{T}$, i.e,
${\theta}^{(k+1)}$  $=\mathrm{arg}\underset{\theta}{\mathrm{min}}{\mathbb{E}}_{s\sim \mathcal{D},i\sim I}\left[\mathrm{KL}({q}_{k}(\cdot s,i)\parallel {\pi}_{\theta}(\cdot s,i))\right]$  (10)  
$=\underset{\theta}{\mathrm{arg}\mathrm{min}}{\mathbb{E}}_{s\sim \mathcal{D},i\sim I,a\sim q(\cdot s,i)}\left[\mathrm{log}{\pi}_{\theta}(as,i)\right],$  
$=\underset{\theta}{\mathrm{arg}\mathrm{max}}{\mathbb{E}}_{s\sim \mathcal{D},i\sim I,a\sim q(\cdot s,i)}\left[\mathrm{log}{\pi}_{\theta}(as,i)\right],$  
$$ 
where $\mathcal{T}$ is an arbitrary distance function to evaluate the change of the new policy with respect to a reference/old policy, and $\u03f5$ denotes the allowed change for the policy.
To make the above objective amenable to gradient based optimization we employ Lagrangian Relaxation, yielding the following primal:
$\underset{\theta}{\mathrm{max}}\underset{\alpha >0}{\mathrm{min}}L(\theta ,\alpha )={\mathbb{E}}_{s\sim \mathcal{D},i\sim I,a\sim q(\cdot s,i)}\left[\mathrm{log}{\pi}_{\theta}(as,i)\right]+$  (11)  
$\alpha (\u03f5{\mathbb{E}}_{s\sim \mathcal{D},i\sim I}\left[\mathcal{T}({\pi}_{{\theta}_{k}}(\cdot s,i),{\pi}_{\theta}(\cdot s,i))\right]).$ 
We solve for $\theta $ by iterating the inner and outer optimization programs independently: We fix the parameters $\theta $ to their current value and optimize for the Lagrangian multipliers (inner minimization) and then we fix the Lagrangian multipliers to their current value and optimize for $\theta $ (outer maximization). In practice we found that it is effective to simply perform one gradient step each in inner and outer optimization for each sampled batch of data.
The optimization given above is general, i.e. it works for any general type of policy. As described in the main paper, we consider hierarchical policies of the form
${\pi}_{\theta}(as,i)={\displaystyle \sum _{o=1}^{M}}{\pi}_{\theta}^{L}\left(as,o\right){\pi}_{\theta}^{H}\left(os,i\right).$  (12) 
In particular, in all experiments we made use of a mixture of Gaussians parametrization, where the high level policy ${\pi}_{\theta}^{H}$ is a categorical distribution over low level ${\pi}_{\theta}^{L}$ Gaussian policies, i.e,
${\pi}_{\theta}(as,i)$  $=\pi (a\{{\alpha}_{\theta}^{j},{\mu}_{\theta}^{j},{\mathrm{\Sigma}}_{\theta}^{j}\}{(s)}_{j=1\mathrm{\dots}M})$  
$={\displaystyle \sum _{j=1}^{M}}{\alpha}_{\theta}^{j}(s,i)\mathcal{N}({\mu}_{\theta}^{j}(s),{\mathrm{\Sigma}}_{\theta}^{j}(s))$  
$\forall s{\displaystyle \sum _{j=1}^{M}}{\alpha}^{j}(s,i)=1\text{, and,}{\alpha}^{j}(s,i)0$ 
where $j$ denote the index of components and $\alpha $ is the high level policy ${\pi}^{H}$ assigning probabilities to each mixture component for a state s given the task and the low level policies are all Gaussian. Here ${\alpha}^{j}$s are the probabilities for a categorical distribution over the components.
We also define the following distance function between old and new mixture of Gaussian policies
$$\mathcal{T}({\pi}_{{\theta}_{k}}(\cdot s,i),{\pi}_{\theta}(\cdot s,i))={\mathcal{T}}_{H}(s,i)+{\mathcal{T}}_{L}(s)$$ 
$${\mathcal{T}}_{H}(s,i)=\text{KL}(\text{Cat}({\{{\alpha}_{{\theta}_{k}}^{j}(s,i)\}}_{j=1\mathrm{\dots}M})\parallel \text{Cat}({\{{\alpha}_{\theta}^{j}(s,i)\}}_{j=1\mathrm{\dots}M}))$$ 
$${\mathcal{T}}_{L}(s)=\frac{1}{M}\sum _{j=1}^{M}\text{KL}(\mathcal{N}({\mu}_{{\theta}_{k}}^{j}(s),{\mathrm{\Sigma}}_{{\theta}_{k}}^{j}(s))\parallel \mathcal{N}({\mu}_{\theta}^{j}(s),{\mathrm{\Sigma}}_{\theta}^{j}(s)))$$ 
where ${\mathcal{T}}_{H}$ evaluates the KL between categorical distributions and ${\mathcal{T}}_{L}$ corresponds to the average KL across Gaussian components, as also described in the main paper (c.f. Equation 5 in the main paper).
In order to bound the change of categorical distributions, means and covariances of the components independently – which makes it easy to control the convergence of the policy and which can prevent premature convergence as argued in Abdolmaleki et al. [2] – we separate out the following three intermediate policies
${\pi}_{\theta}^{\mathrm{\Sigma}}(as,i)=\pi (a\{{\alpha}_{{\theta}_{k}}^{j},{\mu}_{{\theta}_{k}}^{j},{\mathrm{\Sigma}}_{\theta}^{j}\}{(s,i)}_{j=1\mathrm{\dots}M})$  
${\pi}_{\theta}^{\mu}(as,i)=\pi (a\{{\alpha}_{{\theta}_{k}}^{j},{\mu}_{\theta}^{j},{\mathrm{\Sigma}}_{{\theta}_{k}}^{j}\}{(s,i)}_{j=1\mathrm{\dots}M})$  
${\pi}_{\theta}^{\alpha}(as,i)=\pi (a\{{\alpha}_{\theta}^{j},{\mu}_{{\theta}_{k}}^{j},{\mathrm{\Sigma}}_{{\theta}_{k}}^{j}\}{(s,i)}_{j=1\mathrm{\dots}M})$ 
Which yields the following final optimization program
${\theta}^{(k+1)}$  $=\underset{\theta}{\mathrm{arg}\mathrm{max}}{\mathbb{E}}_{s\sim \mathcal{D},i\sim I,a\sim q(\cdot s,i)}[$  (13)  
$\mathrm{log}{\pi}_{\theta}^{\mu}(as,i)+\mathrm{log}{\pi}_{\theta}^{\mathrm{\Sigma}}(as,i)+\mathrm{log}{\pi}_{\theta}^{\alpha}(as,i)],$  
$$  
$$  
$$ 
This decoupling allows us to set different $\u03f5$ values for the change of means, covariances and categorical probabilities, i.e., ${\u03f5}_{\mu},{\u03f5}_{\mathrm{\Sigma}},{\u03f5}_{\alpha}$. Different $\u03f5$ lead to different learning rates. We always set a much smaller epsilon for the covariance and categorical than for the mean. The intuition is that while we would like the distribution to converge quickly in the action space, we also want to keep the exploration both locally (via the covariance matrix) and globally (via the high level categorical distribution) to avoid premature convergence.
AB Algorithmic Details
AB1 Pseudocode for the full procedure
We provide a pseudocode listing of the full optimization procedure – and the asynchronous data gathering – performed by RHPO in Algorithm AB1 and AB1. The implementation relies on Sonnet [37] and TensorFlow [1].
\[email protected]@algorithmic \STATEInput: ${N}_{steps}$ number of update steps, ${N}_{\text{targetUpdate}}$ update steps between target update, ${N}_{s}$ number of action samples per state, KL regularization parameters $\u03f5$, initial parameters for $\pi ,\eta $ and $\varphi $ \STATEinitialize N = 0 \WHILE$k\le {N}_{\text{steps}}$ \FOR$k$ in $[0\mathrm{\dots}{N}_{\text{targetUpdate}}]$ \STATEsample a batch of trajectories $\tau $ from replay buffer $B$ \STATEsample ${N}_{s}$ actions from ${\pi}_{{\theta}_{k}}$ to estimate expectations below \STATE// compute mean gradients over batch for policy, Lagrangian multipliers and Qfunction \STATE${\delta}_{\pi}\leftarrow {\nabla}_{\theta}{\sum}_{{s}_{t}\in \tau}{\sum}_{j=1}^{{N}_{s}}[\mathrm{exp}\left(\frac{Q({s}_{t},{a}_{j},i)}{\eta}\right)$ \STATE\hfill$\mathrm{log}{\pi}_{\theta}({a}_{j}{s}_{t},i)]$ following Eq. 6 \STATE${\delta}_{\eta}\leftarrow {\nabla}_{\eta}g(\eta )={\nabla}_{\eta}\eta \u03f5+\eta {\sum}_{{s}_{t}\in \tau}\mathrm{log}\frac{1}{{N}_{s}}{\sum}_{j=1}^{{N}_{s}}[$ \STATE\hfill$\mathrm{exp}\left(\frac{Q({s}_{t},{a}_{j},i)}{\eta}\right)]$ following Eq. 9 \STATE${\delta}_{Q}\leftarrow {\nabla}_{\varphi}{\sum}_{i\sim I}{\sum}_{({s}_{t},{a}_{t})\in \tau}{\left({\widehat{Q}}_{\varphi}({s}_{t},{a}_{t},i){Q}^{\text{ret}}\right)}^{2}$ \STATE\hfillwith ${Q}^{\text{ret}}$ following Eq. 7 \STATE// apply gradient updates \STATE${\pi}_{{\theta}_{k+1}}=$ optimizer_update($\pi $, ${\delta}_{\pi}$), \STATE$\eta =$ optimizer_update($\eta $, ${\delta}_{\eta}$) \STATE${\widehat{Q}}_{\varphi}=$ optimizer_update(${\widehat{Q}}_{\varphi}$, ${\delta}_{Q}$) \STATE$k=k+1$ \ENDFOR\STATE// update target networks \STATE${\pi}^{\prime}=\pi $, ${Q}^{\prime}=Q$ \ENDWHILE
\[email protected]@algorithmic \STATEInput: ${N}_{\text{trajectories}}$ number of total trajectories requested, $T$ steps per episode, $\xi $ scheduling period \STATEinitialize N = 0 \WHILE$$ \STATEfetch parameters $\theta $ \STATE// collect new trajectory from environment \STATE$\tau =\{\}$\FOR$t$ in $[0\mathrm{\dots}T]$ \IF$t\phantom{\rule{veryverythickmathspace}{0ex}}(mod\xi )\equiv 0$ \STATE// sample active task from uniform distribution \STATE${i}_{act}\sim I$ \ENDIF\STATE${a}_{t}\sim {\pi}_{\theta}(\cdot {s}_{t},{i}_{act})$ \STATE// execute action and determine rewards for all tasks \STATE$\overline{r}=[{r}_{{i}_{1}}({s}_{t},{a}_{t}),\mathrm{\dots},{r}_{{i}_{I}}({s}_{t},{a}_{t})]$ \STATE$\tau \leftarrow \tau \cup \{({s}_{t},{a}_{t},\overline{r},{\pi}_{\theta}({a}_{0}{s}_{t},{i}_{act}))\}$ \ENDFOR\STATEsend batch trajectories $\tau $ to replay buffer \STATE$N=N+1$ \ENDWHILE
AB2 Details on the policy improvement step
As described in the main paper we consider the same setting as scheduled auxiliary control setting (SACX) [38] to perform policy improvement (with uniform random switches between tasks every N steps within an episode, the SACU setting).
Given a replay buffer containing data gathered from all tasks, where for each trajectory snippet $\tau =\{({s}_{0},{a}_{0},{R}_{0}),\mathrm{\dots},({s}_{L},{a}_{L},{R}_{L})\}$ we record the rewards for all tasks ${R}_{t}=[{r}_{{i}_{1}}({s}_{t},{a}_{t}),\mathrm{\dots},{r}_{{i}_{I}}({s}_{t},{a}_{t})]$ as a vector in the buffer, we define the retrace objective for learning $\widehat{Q}$, parameterized via $\varphi $, following Riedmiller et al. [38] as
$\underset{\varphi}{\mathrm{min}}L(\varphi )={\displaystyle \sum _{i\sim I}}{\mathbb{E}}_{\tau \sim \mathcal{D}}[({r}_{i}({s}_{t},{a}_{t})+$  (14)  
${\gamma {Q}^{ret}({s}_{t+1},{a}_{t+1},i){\widehat{Q}}_{\varphi}({s}_{t},{a}_{t},i))}^{2}],$  
$\text{with}{Q}^{\text{ret}}({s}_{t},{a}_{t},i)={\displaystyle \sum _{j=t}^{\mathrm{\infty}}}\left({\gamma}^{jt}{\displaystyle \prod _{k=t}^{j}}{c}_{k}\right)\left[{\delta}_{Q}({s}_{j},{s}_{j+1})\right],$  
${\delta}_{Q}({s}_{j},{s}_{j+1})={r}_{i}({s}_{j},{a}_{j})+$  
$\gamma {\mathbb{E}}_{{\pi}_{{\theta}_{k}}(a{s}_{j+1},i)},[{\widehat{Q}}_{{\varphi}^{\prime}}({s}_{j+1},\cdot ,i;{\varphi}^{\prime})]{\widehat{Q}}_{{\varphi}^{\prime}}({s}_{j},{a}_{j},i),$ 
where the importance weights are defined as ${c}_{k}=\mathrm{min}(1,{\pi}_{{\theta}_{\mathrm{k}}}({\mathrm{a}}_{\mathrm{k}}{\mathrm{s}}_{\mathrm{k}},\mathrm{i})/\mathrm{b}({\mathrm{a}}_{\mathrm{k}}{\mathrm{s}}_{\mathrm{k}}))$, with $b({a}_{k}{s}_{k})$ denoting an arbitrary behavior policy; in particular this will be the policy for the executed tasks during an episode as in [38]. Note that, in practice, we truncate the infinite sum after $L$ steps, bootstrapping with $\widehat{Q}$. We further perform optimization of Equation (7) via gradient descent and make use of a target network [29], denoted with parameters ${\varphi}^{\prime}$, which we copy from $\varphi $ after a couple of gradient steps. We reiterate that, as the stateaction value function $\widehat{Q}$ remains independent of the policy’s structure, we are able to utilize any other offtheshelf Qlearning algorithm such as TD(0) [47].
Given that we utilize the same policy evaluation mechanism as SACU it is worth pausing here to identify the differences between SACU and our approach. The main difference is in the policy parameterization: SACU used a monolithic policy for each task $\pi (as,i)$ (although a neural network with shared components, potentially leading to some implicit task transfer, was used). Furthermore, we perform policy optimization based on MPO instead of using stochastic value gradients (SVG [21]). We can thus recover a variant of plain SACU using MPO if we drop the hierarchical policy parameterization, which we employ in the single task experiments in the main paper.
AB3 Network Architectures
To represent the Qfunction in the multitask case we use the network architecture from SACX (see right subfigure in Figure 10). The proprioception of the robot, the features of the objects and the actions are fed together in a torso network. At the input we use a fully connected first layer of 200 units, followed by a layer normalization operator, an optional tanh activation and another fully connected layer of 200 units with an ELU activation function. The output of this torso network is shared by independent head networks for each of the tasks (or intentions, as they are called in the SACX paper). Each head has two fully connected layers and outputs a Qvalue for this task, given the input of the network. Using the task identifier we then can compute the Q value for a given sample by discrete selection of the according head output.
While we use the network architecture for the Q function for all multitask experiments, we investigate different architectures for the policy in this paper. The original SACX policy architecture is shown in Figure 10 (left subfigure). The main structure follows the same basic principle that we use in the Q function architecture. The only difference is that the heads compute the required parameters for the policy distribution we want to use (see subsection AB4). This architecture is referenced as the independent heads (or task dependent heads).
The alternatives we investigate in this paper are the monolithic policy architecture (see Figure 11, left subfigure) and the hierarchical policy architecture (see Figure 11, right subfigure). For the monolithic policy architecture we reduce the original policy architecture basically to one head and append the taskid as a onehot encoded vector to the input. For the hierarchical architecture, we build on the same torso and create a set of networks parameterizing the Gaussians which are shared across tasks and a taskspecific network to parameterize the categorical distribution for each task. The final mixture distribution is taskdependent for the highlevel controller but taskindependent for the lowlevel policies.
AB4 Algorithm Hyperparameters
In this section we outline the details on the hyperparameters used for RHPO and baselines in both single task and multitask experiments. All experiments use feedforward neural networks. We consider a flat policy represented by a Gaussian distribution and a hierarchical policy represented by a mixture of Gaussians distribution. The flat policy is given by a Gaussian distribution with a diagonal covariance matrix, i.e, $\pi (as,\theta )=\mathcal{N}(\mu ,\mathrm{\Sigma})$. The neural network outputs the mean $\mu =\mu (s)$ and diagonal Cholesky factors $A=A(s)$, such that $\mathrm{\Sigma}=A{A}^{T}$. The diagonal factor $A$ has positive diagonal elements enforced by the softplus transform ${A}_{ii}\leftarrow \mathrm{log}(1+\mathrm{exp}({A}_{ii}))$ to ensure positive definiteness of the diagonal covariance matrix. Mixture of Gaussian policy has a number of Gaussian components as well as a categorical distribution for selecting the components. The neural network outputs the Gaussian components based on the same setup described above for a single Gaussian and outputs the logits for representing the categorical distribution. Tables I show the hyperparameters we used for the single tasks experiments. We found layer normalization and a hyperbolic tangent ($tanh$) on the layer following the layer normalization are important for stability of the algorithms. For RHPO the most important hyperparameters are the constraints in Step 1 and Step 2 of the algorithm.
Hyperparameters  Hierarchical  Single Gaussian 
Policy net  200200200  200200200 
Number of actions sampled per state  10  10 
Q function net  500500500  500500500 
Number of components  3  NA 
$\u03f5$  0.1  0.1 
${\u03f5}_{\mu}$  0.0005  0.0005 
${\u03f5}_{\mathrm{\Sigma}}$  0.00001  0.00001 
${\u03f5}_{\alpha}$  0.0001  NA 
Discount factor ($\gamma $)  0.99  0.99 
Adam learning rate  0.0002  0.0002 
Replay buffer size  2000000  2000000 
Target network update period  250  250 
Batch size  256  256 
Activation function  elu  elu 
Layer norm on first layer  Yes  Yes 
Tanh on output of layer norm  Yes  Yes 
Tanh on input actions to Qfunction  Yes  Yes 
Retrace sequence length  10  10 
Hyperparameters  Hierarchical  Independent  Monolith  

Policy torso  
(shared across tasks)  400200  


100  NA  


NA  200  
Number of action samples  20  
Q function torso  
(shared across tasks)  400400  
Q function head  
(per task)  300  
Number of components 

NA  NA  
Discount factor ($\gamma $)  0.99  
Replay buffer size  1e6 * number of tasks  
Target network  
update period  500  
Batch size  256 (512 for Pile1) 
AC Additional Details on the SACU with SVG baseline
For the SACU baseline we used a reimplementation of the method from [38] using SVG [20] for optimizing the policy. Concretely we use the same basic network structure as for the "Monolithic" baseline with MPO and parameterize the policy as
$${\pi}_{\theta}=\mathcal{N}({\mu}_{\theta}(s,i),{\sigma}_{\theta}^{2}(s,i)I),$$ 
where $I$ denotes the identity matrix and ${\sigma}_{\theta}(s,i)$ is computed from the network output via a softplus activation function.
Together with entropy regularization, as described in [38] the policy can be optimized via gradient ascent, following the reparameterized gradient for a given states sampled from the replay:
${\nabla}_{\theta}{\mathbb{E}}_{{\pi}_{\theta}(as,i)}[\widehat{Q}(a,s,i)]+\alpha \mathrm{H}\left({\pi}_{\theta}(as,i)\right),$  (15) 
which can be computed, using the reparameterization trick, as
$\begin{array}{c}\hfill {\mathbb{E}}_{\zeta \sim \mathcal{N}(0,I)}[{\nabla}_{\theta}{g}_{\theta}(s,i,\zeta ){\nabla}_{g}Q({g}_{\theta}(s,i,\zeta ),s,i)]+\\ \hfill \alpha {\nabla}_{\theta}\mathrm{H}\left({\pi}_{\theta}(as,i)\right),\end{array}$  (16) 
where ${g}_{\theta}(s,i,\zeta )={\mu}_{\theta}(s,i)+{\sigma}_{\theta}(\mathbf{s})*\zeta $ is now a deterministic function of a sample from the standard multivariate normal distribution. See e.g. [20] (for SVG) as well as [26] (for the reparameterization trick) for a detailed explanation.
AD Details on the Experimental Setup
AD1 Simulation (Single and Multitask)
For the simulation of the robot arm experiments the numerical simulator MuJoCo ^{‡}^{‡} ‡ MuJoCo: see www.mujoco.org was used – using a model we identified from the real robot setup.
We run experiments of length 2  7 days for the simulation experiments (depending on the task) with access to 25 recent CPUs with 32 cores each (depending on the number of actors) and 2 recent NVIDIA GPUs for the learner. Computation for data buffering is negligible.
AD2 Real Robot Multitask
Compared to simulation where the ground truth position of all objects is known, in the real robot setting, three cameras on the basket track the cube using fiducials (augmented reality tags).
For safety reasons, external forces are measured at the wrist and the episode is terminated if a threshold of 20N on any of the three principle axes is exceeded (this is handled as a terminal state with reward 0 for the agent), adding further to the difficulty of the task.
The real robot setup differs from the simulation in the reset behaviour between episodes, since objects need to be physically moved around when randomizing, which takes a considerable amount of time. To keep overhead small, object positions are randomized only every 25 episodes, using a handcoded controller. Objects are also placed back in the basket if they were thrown out during the previous episode. Other than that, objects start in the same place as they were left in the previous episode. The robot’s starting pose is randomized each episode, as in simulation.
AE Task Descriptions
AE1 Pile1
For this task we have a real setup and a MuJoCo simulation that are well aligned. It consists of a Sawyer robot arm mounted on a table and equipped with a Robotiq 2F85 parallel gripper. In front of the robot there is a basket of size 20x20 cm which contains three cubes with an edge length of 5 cm (see Figure 12).
The agent is provided with proprioception information for the arm (joint positions, velocities and torques), and the tool center point position computed via forward kinematics. For the gripper, it receives the motor position and velocity, as well as a binary grasp flag. It also receives a wrist sensor’s force and torque readings. Finally, it is provided with the cubes’ poses as estimated via the fiducials, and the relative distances between the arm’s tool center point and each object. At each time step, a history of two previous observations is provided to the agent, along with the last two joint control commands, in order to account for potential communication delays on the real robot. The observation space is detailed in Table IV.
The robot arm is controlled in Cartesian mode at 20Hz. The action space for the agent is 5dimensional, as detailed in Table III. The gripper movement is also restricted to a cubic volume above the basket using virtual walls.
Entry  Dims  Unit  Range 

Translational Velocity in x, y, z  3  m/s  [0.07, 0.07] 
Wrist Rotation Velocity  1  rad/s  [1, 1] 
Finger speed  1  tics/s  [255, 255] 
Entry  Dims  Unit 

Joint Position (Arm)  7  rad 
Joint Velocity (Arm)  7  rad/s 
Joint Torque (Arm)  7  Nm 
Joint Position (Hand)  1  rad 
Joint Velocity (Hand)  1  tics/s 
ForceTorque (Wrist)  6  N, Nm 
Binary Grasp Sensor  1  au 
TCP Pose  7  m, au 
Last Control Command (Joint Velocity)  8  rad/s 
Green Cube Pose  7  m, au 
Green Cube Relative Pose  7  m, au 
Yellow Cube Pose  7  m, au 
Yellow Cube Relative Pose  7  m, au 
Blue Cube Pose  7  m, au 
Blue Cube Relative Pose  7  m, au 
For the Pile1 experiment we use 7 different task to learn, following the SACX principles. The first 6 tasks are seen as auxiliary tasks that help to learn the final task (STACK_AND_LEAVE(G, Y)) of stacking the green cube on top of the yellow cube. Overview of the used tasks:

•
REACH(G): $stol(d(TCP,G),0.02,0.15)$:
Minimize the distance of the TCP to the green cube. 
•
GRASP:
Activate grasp sensor of gripper ("inward grasp signal" of Robotiq gripper) 
•
LIFT(G): $slin(G,0.03,0.10)$
Increase z coordinate of an object more than 3cm relative to the table. 
•
PLACE_WIDE(G, Y): $stol(d(G,Y+[0,0,0.05]),0.01,0.20)$
Bring green cube to a position 5cm above the yellow cube. 
•
PLACE_NARROW(G, Y): $stol(d(G,Y+[0,0,0.05]),0.00,0.01)$:
Like PLACE_WIDE(G, Y) but more precise. 
•
STACK(G, Y): $btol({d}_{xy}(G,Y),0.03)*btol({d}_{z}(G,Y)+0.05,0.01)*(1\text{\mathit{G}\mathit{R}\mathit{A}\mathit{S}\mathit{P}})$
Sparse binary reward for bringing the green cube on top of the yellow one (with 3cm tolerance horizontally and 1cm vertically) and disengaging the grasp sensor. 
•
STACK_AND_LEAVE(G, Y): $stol({d}_{z}(TCP,G)+0.10,0.03,0.10)*\mathit{\text{STACK(G, Y)}}$
Like STACK(G, Y), but needs to move the arm 10cm above the green cube.
Let $d({o}_{i},{o}_{j})$ be the distance between the reference of two objects (the reference of the cubes are the center of mass, TCP is the reference of the gripper), and let ${d}_{A}$ be the distance only in the dimensions denoted by the set of axes $A$. We can define the reward function details by:
$$  (17) 
$$  (18) 
$$  (19) 
AE2 Pile2
For the Pile2 task, taken from [38], we use a different robot arm, control mode and task setup to emphasize that RHPO’s improvements are not restricted to cartesian control or a specific robot and that the approach also works for multiple external tasks.
Here, the agent controls a simulated Kinova Jaco robot arm, equipped with a Kinova KG3 gripper. The robot faces a 40 x 40 cm basket that contains a red cube and a blue cube. Both cubes have an edge length of 5 cm (see Figure 13). The agent is provided with proprioceptive information for the arm and the fingers (joint positions and velocities) as well as the tool center point position (TCP) computed via forward kinematics. Further, the simulated gripper is equipped with a touch sensor for each of the three fingers, whose value is provided to the agent as well. Finally, the agent receives the cubes’ poses, their translational and rotational velocities and the relative distances between the arm’s tool center point and each object. Neither observation nor action history is used in the Pile2 experiments. The cubes are spawned at random on the table surface and the robot hand is initialized randomly above the tabletop with a height offset of up to 20 cm above the table (minimum 10 cm). The observation space is detailed in Table VI.
Entry  Dims  Unit  Range 

Joint Velocity (Arm)  6  rad/sec  [0.8, 0.8] 
Joint Velocity (Hand)  3  rad/sec  [0.8, 0.8] 
The robot arm is controlled in raw joint velocity mode at 20 Hz. The action space is 9dimensional as detailed in Table V. There are no virtual walls and the robot’s movement is solely restricted by the velocity limits and the objects in the scene.
Analogous to Pile1 and the SACX setup, we use 10 different task for Pile2. The first 8 tasks are seen as auxiliary tasks, that the agent uses to learn the main two tasks PILE_RED and PILE_BLUE, which represent stacking the red cube on the blue cube and stacking the blue cube on the red cube respectively. The tasks used in the experiment are:

•
REACH(R) = $stol(d(TCP,R),0.01,0.25)$:
Minimize the distance of the TCP to the red cube. 
•
REACH(B) = $stol(d(TCP,B),0.01,0.25)$:
Minimize the distance of the TCP to the blue cube. 
•
MOVE(R) = $slin(linvel(R),0,1)$:
Move the red cube. 
•
MOVE(B) = $slin(linvel(B),0,1)$:
Move the blue cube. 
•
LIFT(R) = $btol(po{s}_{z}(R),0.05)$
Increase the zcoordinate of the red cube to more than 5cm relative to the table. 
•
LIFT(B) = $btol(po{s}_{z}(B),0.05)$
Increase the zcoordinate of the blue cube to more than 5cm relative to the table. 
•
ABOVE_CLOSE(R, B) = $above(R,B)*stol(d(R,B),0.05,0.2)$
Bring the red cube to a position above of and close to the blue cube. 
•
ABOVE_CLOSE(B, R) = $above(B,R)*stol(d(R,B),0.05,0.2)$
Bring the blue cube to a position above of and close to the red cube. 
•
PILE(R):
Place the red cube on another object (touches the top). Only given when the cube doesn’t touch the robot or the table. 
•
PILE(B):
Place the blue cube on another object (touches the top). Only given when the cube doesn’t touch the robot or the table.
The sparse reward above(A, B) is given by comparing the bounding boxes of the two objects A and B. If the bounding box of object A is completely above the highest point of object B’s bounding box, above(A, B) is $1$, otherwise above(A, B) is $0$.
Entry  Dims  Unit 

Joint Position (Arm)  6  rad 
Joint Velocity (Arm)  6  rad/s 
Joint Position (Hand)  3  rad 
Joint Velocity (Hand)  3  rad/s 
TCP Position  3  m 
Touch Force (Fingers)  3  N 
Red Cube Pose  7  m, au 
Red Cube Velocity  6  m/s, dq/dt 
Red Cube Relative Position  3  m 
Blue Cube Pose  7  m, au 
Blue Cube Velocity  6  m/s, dq/dt 
Blue Cube Relative Position  3  m 
Lid Position  1  rad 
Lid Velocity  1  rad/s 
AE3 CleanUp
The CleanUp task is also taken from [38] and builds on the setup described for the Pile2 task. Besides the two cubes, the workspace contains an additional box with a moveable lid, that is always closed initially (see Figure 14). The agent’s goal is to clean up the scene by placing the cubes inside the box. In addition to the observations used in the Pile2 task, the agent observes the lid’s angle and it’s angle velocity.
Analogous to Pile2 and the SACX setup, we use 13 different task for CleanUp. The first 12 tasks are seen as auxiliary tasks, that the agent uses to learn the main task ALL_INSIDE_BOX. The tasks used in this experiments are:

•
REACH(R) = $stol(d(TCP,R),0.01,0.25)$:
Minimize the distance of the TCP to the red cube. 
•
REACH(B) = $stol(d(TCP,B),0.01,0.25)$:
Minimize the distance of the TCP to the blue cube. 
•
MOVE(R) = $slin(linvel(R),0,1)$:
Move the red cube. 
•
MOVE(B) = $slin(linvel(B),0,1)$:
Move the blue cube. 
•
NO_TOUCH = $1GRASP$
Sparse binary reward, given when neither of the touch sensors is active. 
•
LIFT(R) = $btol(po{s}_{z}(R),0.05)$
Increase the zcoordinate of the red cube to more than 5cm relative to the table. 
•
LIFT(B) = $btol(po{s}_{z}(B),0.05)$
Increase the zcoordinate of the blue cube to more than 5cm relative to the table. 
•
OPEN_BOX = $slin(angle(lid),0.01,1.5)$
Open the lid up to 85 degrees. 
•
ABOVE_CLOSE(R, BOX) = $above(R,BOX)*btol(d(R,BOX),0.2)$
Bring the red cube to a position above of and close to the box. 
•
ABOVE_CLOSE(B, BOX) = $above(B,BOX)*btol(d(B,BOX),0.2)$
Bring the blue cube to a position above of and close to the box. 
•
INSIDE(R, BOX) = $inside(R,BOX)$
Place the red cube inside the box. 
•
INSIDE(B, BOX) = $inside(R,BOX)$
Place the blue cube inside the box. 
•
INSIDE(ALL, BOX) = $INSIDE(R,BOX)*INSIDE(B,BOX)$
Place the all cubes inside the box.
The sparse reward inside(A, BOX) is given by comparing the bounds of the object A and the box. If the bounding box of object A is completely within the box’s bounds inside(A, BOX) is $1$, otherwise inside(A, BOX) is $0$.
AF Multitask Results
AG Additional Multitask Ablations
AG1 Importance of Regularization
Coordinating convergence progress in hierarchical models can be challenging but can be effectively moderated by the KL constraints. We perform an ablation study varying the strength of KL constraints on the highlevel controller between prior and the current policy during training – demonstrating a range of possible degenerate behaviors.
As depicted in Figure 19, with a weak KL constraint, the highlevel controller can converge too quickly leading to only a single subpolicy getting a gradient signal per step. In addition, the categorical distribution tends to change at a high rate, preventing successful convergence for the lowlevel policies. On the other hand, the lowlevel policies are missing task information to encourage decomposition as described in Section IIIB. This fact, in combination with strong KL constraints, can prevent specialization of the lowlevel policies as the categorical remains near static, finally leading to no or very slow convergence. As long as a reasonable constraint is picked (here a range of over 2 orders of magnitude), convergence is fast and the final policies obtain high quality for all tasks. We note that no tuning of the constraints is required across domains and the range of admissible constraints is quite broad.
AG2 Impact of Data Rate
Evaluating in a distributed offpolicy setting enables us to investigate the effect of different rates for data generation by controlling the number of actors. Figure 20 demonstrates how the different agents converge slower lower data rates (changing from 5 to 1 actor). These experiments are highly relevant for the application domain as the number of available physical robots for realworld experiments is typically highly limited. To limit computational cost, we focus on the simplest domain from Section IVA, Pile1, in this comparison.
Reach 
Grasp 
Lift 
Place Wide 
Place Narrow 
Stack 
Stack and Leave 
AG3 Number of Component Policies
AH Hierarchical Policies in Reparameterization Gradientbased RL
To test whether the benefits of a hierarchical policy transfer to a setting where a different algorithm is used to optimize the policy we performed additional experiments using SVG [20] in place of MPO. For this purpose we use the same hierarchical policy structure as for the MPO experiments but change the categorical to an implementation that enables reparameterization with the GumbelSoftmax trick [27, 24]. We then change the entropy regularization from Equation (16) to a KL towards a target policy (as entropy regularization did not give stable learning in this setting) and use a regularizer equivalent to the distance function (per component KL’s from Equation (13)) – using a multiplier of 0.05 for the regularization multiplier was found to be the best setting via a coarse grid search. This is similar to previous work on hierarchical RL with SVG [52].
This extension of SVG is conceptually similar to a singlestepoption version of the optioncritic [6]. Simplified, SVG is an offpolicy actorcritic algorithm which builds on the reparametrisation instead of likelihood ratio trick (commonly leading to lower variance [30]). Since we do not build on temporally extended subpolicies, the algorithm simplifies to using a single critic (see Section IIIB).
The results of this experiment are depicted in Figure 22, as can be seen, for this simple domain results in mild improvements over standard SACU.