Abstract
Hierarchical reinforcement learning is a promising approach to tacklelonghorizon decisionmaking problems with sparse rewards. Unfortunately, mostmethods still decouple the lowerlevel skill acquisition process and thetraining of a higher level that controls the skills in a new task. Leaving theskills fixed can lead to significant suboptimality in the transfer setting. Inthis work, we propose a novel algorithm to discover a set of skills, andcontinuously adapt them along with the higher level even when training on a newtask. Our main contributions are twofold. First, we derive a new hierarchicalpolicy gradient with an unbiased latentdependent baseline, and we introduceHierarchical Proximal Policy Optimization (HiPPO), an onpolicy method toefficiently train all levels of the hierarchy jointly. Second, we propose amethod of training timeabstractions that improves the robustness of theobtained skills to environment changes. Code and results are available atsites.google.com/view/hipporl
Quick Read (beta)
SubPolicy Adaptation for Hierarchical
Reinforcement Learning
Abstract
Hierarchical reinforcement learning is a promising approach to tackle longhorizon decisionmaking problems with sparse rewards. Unfortunately, most methods still decouple the lowerlevel skill acquisition process and the training of a higher level that controls the skills in a new task. Leaving the skills fixed can lead to significant suboptimality in the transfer setting. In this work, we propose a novel algorithm to discover a set of skills, and continuously adapt them along with the higher level even when training on a new task. Our main contributions are twofold. First, we derive a new hierarchical policy gradient with an unbiased latentdependent baseline, and we introduce Hierarchical Proximal Policy Optimization (HiPPO), an onpolicy method to efficiently train all levels of the hierarchy jointly. Second, we propose a method of training timeabstractions that improves the robustness of the obtained skills to environment changes. Code and videos are available here. ^{†}^{†} ${}^{*}$ Equal Contribution
SubPolicy Adaptation for Hierarchical
Reinforcement Learning
Alexander C. Li${}^{\mathrm{*}}$, Carlos Florensa${}^{\mathrm{*}}$, Ignasi Clavera, Pieter Abbeel 
Department of Computer Science 
University of California, Berkeley 
{alexli1, florensa, iclavera, pabbeel}@berkeley.edu 
1 Introduction
Reinforcement learning (RL) has made great progress in a variety of domains, from playing games such as Pong and Go (mnih2015dqn_human; silver2017alphago) to automating robotic locomotion (schulman2015trpo; heess2017emergence), dexterous manipulation (florensa2017reverse; andrychowicz2018inhand), and perception (nair2018rig; florensa2018selfsupervised). Yet, most work in RL is still learning from scratch when faced with a new problem. This is particularly inefficient when tackling multiple related tasks that are hard to solve due to sparse rewards or long horizons.
A promising technique to overcome this limitation is hierarchical reinforcement learning (HRL) (sutton1999temporal). In this paradigm, policies have several modules of abstraction, allowing to reuse subsets of the modules. The most common case consists of temporal hierarchies (Precup2000TemporalLearning; dayan1993feudal), where a higherlevel policy (manager) takes actions at a lower frequency, and its actions condition the behavior of some lower level skills or subpolicies. When transferring knowledge to a new task, most prior works fix the skills and train a new manager on top. Despite having a clear benefit in kickstarting the learning in the new task, having fixed skills can considerably cap the final performance on the new task (florensa2017snn). Little work has been done on adapting pretrained subpolicies to be optimal for a new task.
In this paper, we develop a new framework for simultaneously adapting all levels of temporal hierarchies. First, we derive an efficient approximated hierarchical policy gradient. The key insight is that, despite the decisions of the manager being unobserved latent variables from the point of view of the Markovian environment, from the perspective of the subpolicies they can be considered as part of the observation. We show that this provides a decoupling of the manager and subpolicy gradients, which greatly simplifies the computation in a principled way. It also theoretically justifies a technique used in other prior works (frans2018mlsh). Second, we introduce a subpolicy specific baseline for our hierarchical policy gradient. We prove that this baseline is unbiased, and our experiments reveal faster convergence, suggesting efficient gradient variance reduction. Then we introduce a more stable way of using this gradient, Hierarchical Proximal Policy Optimization (HiPPO). This method helps us take more conservative steps in our policy space (schulman2017ppo), critical in hierarchies because of the interdependence of each layer. Results show that HiPPO is highly efficient when learning from scratch, i.e. adapting randomly initialized skills, and when adapting pretrained skills on a new task. Finally, we evaluate the benefit of randomizing the timecommitment of the subpolicies, and show it helps both in terms of final performance and zeroshot adaptation on similar tasks.
2 Preliminaries
We define a discretetime finitehorizon discounted Markov decision process (MDP) by a tuple $M=(\mathcal{S},\mathcal{A},\mathcal{P},r,{\rho}_{0},\gamma ,H)$, where $\mathcal{S}$ is a state set, $\mathcal{A}$ is an action set, $\mathcal{P}:\mathcal{S}\times \mathcal{A}\times \mathcal{S}\to {\mathbb{R}}_{+}$ is the transition probability distribution, $\gamma \in [0,1]$ is a discount factor, and $H$ the horizon. Our objective is to find a stochastic policy ${\pi}_{\theta}$ that maximizes the expected discounted return within the MDP, $\eta ({\pi}_{\theta})={\mathbb{E}}_{\tau}[{\sum}_{t=0}^{H}{\gamma}^{t}r({s}_{t},{a}_{t})]$. We use $\tau =({s}_{0},{a}_{0},\mathrm{\dots},)$ to denote the entire stateaction trajectory, where ${s}_{0}\sim {\rho}_{0}({s}_{0})$, ${a}_{t}\sim {\pi}_{\theta}({a}_{t}{s}_{t}),$ and ${s}_{t+1}\sim \mathcal{P}({s}_{t+1}{s}_{t},{a}_{t})$.
In this work, we propose a method to learn a hierarchical policy and efficiently adapt all the levels in the hierarchy to perform a new task. We study a quite extended type of hierarchical policies composed of a higher level, or manager ${\pi}_{{\theta}_{h}}({z}_{t}{s}_{t})$, and a lower level, or subpolicy ${\pi}_{{\theta}_{l}}({a}_{{t}^{\prime}}{z}_{t},{s}_{{t}^{\prime}})$. The higher level does not take actions in the environment directly, but rather outputs a command, or latent variable ${z}_{t}\in \mathcal{Z}$, that conditions the behavior of the lower level. We focus on the common case where $\mathcal{Z}={\mathbb{Z}}_{n}$ making the manager choose among $n$ subpolicies, or skills, to execute. The manager typically operates at a lower frequency than the subpolicies, only observing the environment every $p$ timesteps. When the manager receives a new observation, it decides which low level policy to commit to for $p$ environment steps by the means of a latent code $z$. Figure 1 depicts this framework where the high level frequency $p$ is a random variable, which is one of the contribution of this paper as described in Section 4.4. Note that the class of hierarchical policies we work with is more restrictive than others like the Option framework, where the timecommitment is also decided by the policy. Nevertheless, we show that this loss in policy expressivity acts as a regularizer and does not prevent our algorithm from surpassing other stateofthe art methods.
3 Related Work
There has been growing interest in HRL for the past few decades (sutton1999temporal; Precup2000TemporalLearning), but only recently has it been applied to highdimensional continuous domains as we do in this work (Kulkarni2016hrl; Daniel2016options). To obtain the lower level policies, or skills, most methods exploit some additional assumptions, like access to demonstrations (Le2018hirl; Merel2019humanoids; Ranchod2015skill; Sharma2018directedInfoGAIL), policy sketches (Andreas2017modular), or task decomposition into subtasks (Ghavamzadeh2003HPGA; sohn2018subtask). Other methods use a different reward for the lower level, often constraining it to be a “goal reacher” policy, where the signal from the higher level is the goal to reach (nachum2018hiro; levy2019HRLhindsight; vezhnevets2017fun). These methods are very promising for statereaching tasks, but might require access to goalreaching reward systems not defined in the original MDP, and are more limited when training on tasks beyond statereaching. Our method does not require any additional supervision, and the obtained skills are not constrained to be goalreaching.
When transferring skills to a new environment, most HRL methods keep them fixed and simply train a new higherlevel on top (hausman2018embedding; heess2016modulated). Other work allows for building on previous skills by constantly supplementing the set of skills with new ones (shu2019hrl), but they require a handdefined curriculum of tasks, and the previous skills are never finetuned. Our algorithm allows for seamless adaptation of the skills, showing no tradeoff between leveraging the power of the hierarchy and the final performance in a new task. Other methods use invertible functions as skills (haarnoja2018hierarchical), and therefore a fixed skill can be fully overwritten when a new layer of hierarchy is added on top. This kind of “finetuning” is promising, although similar to other works (peng2019composable), they do not apply it to temporally extended skills as we do here.
One of the most general frameworks to define temporally extended hierarchies is the options framework (sutton1999temporal), and it has recently been applied to continuous state spaces (bacon2017option). One of the most delicate parts of this formulation is the termination policy, and it requires several regularizers to avoid skill collapse (harb2017wait; Vezhnevets2016straw). This modification of the objective may be difficult to tune and affects the final performance. Instead of adding such penalties, we propose having skills of a random length, not controlled by the agent during training of the skills. The benefit is twofold: no termination policy to train, and more stable skills that transfer better. Furthermore, these works only used discrete action MDPs. We lift this assumption, and show good performance of our algorithm in complex locomotion tasks. There are other algorithms recently proposed that go in the same direction, but we found them more complex, less principled (their peraction marginalization cannot capture well the temporal correlation within each option), and without available code or evidence of outperforming nonhierarchical methods (smith2019inference).
The closest work to ours in terms of final algorithm structure is the one proposed by frans2018mlsh. Their method can be included in our framework, and hence benefits from our new theoretical insights. We introduce a modification that is shown to be highly beneficial: the random timecommitment mentioned above, and find that our method can learn in difficult environments without their complicated training scheme.
4 Efficient Hierarchical Policy Gradients
When using a hierarchical policy, the intermediate decision taken by the higher level is not directly applied in the environment. Therefore, technically it should not be incorporated into the trajectory description as an observed variable, like the actions. This makes the policy gradient considerably harder to compute. In this section we first prove that, under mild assumptions, the hierarchical policy gradient can be accurately approximated without needing to marginalize over this latent variable. Then, we derive an unbiased baseline for the policy gradient that can reduce the variance of its estimate. Finally, with these findings, we present our method, Hierarchical Proximal Policy Optimization (HiPPO), an onpolicy algorithm for hierarchical policies, allowing learning at all levels of the policy jointly and preventing subpolicy collapse.
4.1 Approximate Hierarchical Policy Gradient
Policy gradient algorithms are based on the likelihood ratio trick (williams1992reinforce) to estimate the gradient of returns with respect to the policy parameters as
${\nabla}_{\theta}\eta ({\pi}_{\theta})={\mathbb{E}}_{\tau}\left[{\nabla}_{\theta}\mathrm{log}P(\tau )R(\tau )\right]$  $\approx {\displaystyle \frac{1}{N}}{\displaystyle \sum _{i=1}^{n}}{\nabla}_{\theta}\mathrm{log}P({\tau}_{i})R({\tau}_{i})$  (1)  
$={\displaystyle \frac{1}{N}}{\displaystyle \sum _{i=1}^{n}}{\displaystyle \frac{1}{H}}{\displaystyle \sum _{t=1}^{H}}{\nabla}_{\theta}\mathrm{log}{\pi}_{\theta}({a}_{t}{s}_{t})R({\tau}_{i})$  (2) 
In a temporal hierarchy, a hierarchical policy with a manager ${\pi}_{{\theta}_{h}}({z}_{t}{s}_{t})$ selects every $p$ timesteps one of $n$ subpolicies to execute. These subpolicies, indexed by $z\in {\mathbb{Z}}_{n}$, can be represented as a single conditional probability distribution over actions ${\pi}_{{\theta}_{l}}({a}_{t}{z}_{t},{s}_{t})$. This allows us to not only use a given set of subpolicies, but also leverage skills learned with Stochastic Neural Networks (SNNs) (florensa2017snn). Under this framework, the probability of a trajectory $\tau =({s}_{0},{a}_{0},{s}_{1},\mathrm{\dots},{s}_{H})$ can be written as
$P(\tau )=\left({\displaystyle \prod _{k=0}^{H/p}}\left[{\displaystyle \sum _{j=1}^{n}}{\pi}_{{\theta}_{h}}({z}_{j}{s}_{kp}){\displaystyle \prod _{t=kp}^{(k+1)p1}}{\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}_{j})\right]\right)\left[P({s}_{0}){\displaystyle \prod _{t=1}^{H}}P({s}_{t+1}{s}_{t},{a}_{t})\right].$  (3) 
The mixture action distribution, which presents itself as an additional summation over skills, prevents additive factorization when taking the logarithm, as from Eq. 1 to 2. This can yield numerical instabilities due to the product of the $p$ subpolicy probabilities. For instance, in the case where all the skills are distinguishable all the subpolicies’ probabilities but one will have small values, resulting in an exponentially small value. In the following Lemma, we derive an approximation of the policy gradient, whose error tends to zero as the skills become more diverse, and draw insights on the interplay of the manager actions.
Lemma 1.
If the skills are sufficiently differentiated, then the latent variable can be treated as part of the observation to compute the gradient of the trajectory probability. Let ${\pi}_{{\theta}_{h}}\mathit{}\mathrm{(}z\mathrm{}s\mathrm{)}$ and ${\pi}_{{\theta}_{l}}\mathit{}\mathrm{(}a\mathrm{}s\mathrm{,}z\mathrm{)}$ be Lipschitz functions w.r.t. their parameters, and assume that $$, then
${\nabla}_{\theta}\mathrm{log}P(\tau )={\displaystyle \sum _{k=0}^{H/p}}{\nabla}_{\theta}\mathrm{log}{\pi}_{{\theta}_{h}}({z}_{kp}{s}_{kp})+{\displaystyle \sum _{t=0}^{H}}{\nabla}_{\theta}\mathrm{log}{\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}_{kp})+\mathcal{O}(nH{\u03f5}^{p1})$  (4) 
Proof.
See Appendix. ∎
Our assumption can be seen as having diverse skills. Namely, for each action there is just one subpolicy that gives it high probability. In this case, the latent variable can be treated as part of the observation to compute the gradient of the trajectory probability. Many algorithms to extract lowerlevel skills are based on promoting diversity among the skills (florensa2017snn; Eysenbach2019diayn), therefore usually satisfying our assumption. We further analyze how well this assumption holds in our experiments section and Table 2.
4.2 Unbiased SubPolicy Baseline
The policy gradient estimate obtained when applying the loglikelihood ratio trick as derived above is known to have large variance. A very common approach to mitigate this issue without biasing the estimate is to subtract a baseline from the returns (Peters2008b). It is well known that such baselines can be made statedependent without incurring any bias. However, it is still unclear how to formulate a baseline for all the levels in a hierarchical policy, since an action dependent baseline does introduce bias in the gradient (tucker2018mirage). It has been recently proposed to use latentconditioned baselines (weber2019computation). Here we go further and prove that, under the assumptions of Lemma 1, we can formulate an unbiased latent dependent baseline for the approximate gradient (Eq. 5).
Lemma 2.
For any functions ${b}_{h}\mathrm{:}\mathrm{S}\mathrm{\to}\mathrm{R}$ and ${b}_{l}\mathrm{:}\mathrm{S}\mathrm{\times}\mathrm{Z}\mathrm{\to}\mathrm{R}$ we have:
$${\mathbb{E}}_{\tau}[\sum _{k=0}^{H/p}{\nabla}_{\theta}\mathrm{log}P({z}_{kp}{s}_{kp}){b}_{h}({s}_{kp})]=0\mathit{\text{and}}{\mathbb{E}}_{\tau}[\sum _{t=0}^{H}{\nabla}_{\theta}\mathrm{log}{\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}_{kp}){b}_{l}({s}_{t},{z}_{kp})]=0$$ 
Proof.
See Appendix. ∎
Now we apply Lemma 1 and Lemma 2 to Eq. 1. By using the corresponding value functions as the function baseline, the return can be replaced by the Advantage function $A({s}_{kp},{z}_{kp})$ (see details in schulman2016gae), and we obtain the following approximate policy gradient expression:
$\widehat{g}={\mathbb{E}}_{\tau}\left[({\displaystyle \sum _{k=0}^{H/p}}{\nabla}_{\theta}\mathrm{log}{\pi}_{{\theta}_{h}}({z}_{kp}{s}_{kp})A({s}_{kp},{z}_{kp}))+({\displaystyle \sum _{t=0}^{H}}{\nabla}_{\theta}\mathrm{log}{\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}_{kp})A({s}_{t},{a}_{t},{z}_{kp}))\right]$ 
This hierarchical policy gradient estimate can have lower variance than without baselines, but using it for policy optimization through stochastic gradient descent still yields an unstable algorithm. In the next section, we further improve the stability and sample efficiency of the policy optimization by incorporating techniques from Proximal Policy Optimization (schulman2017ppo).
4.3 Hierarchical Proximal Policy Optimization
Using an appropriate step size in policy space is critical for stable policy learning. Modifying the policy parameters in some directions may have a minimal impact on the distribution over actions, whereas small changes in other directions might change its behavior drastically and hurt training efficiency (kakade2002natural). Trust region policy optimization (TRPO) uses a constraint on the KLdivergence between the old policy and the new policy to prevent this issue (schulman2015trpo). Unfortunately, hierarchical policies are generally represented by complex distributions without closed form expressions for the KLdivergence. Therefore, to improve the stability of our hierarchical policy gradient we turn towards Proximal Policy Optimization (PPO) (schulman2017ppo). PPO is a more flexible and computeefficient algorithm. In a nutshell, it replaces the KLdivergence constraint with a cost function that achieves the same trust region benefits, but only requires the computation of the likelihood. Letting ${w}_{t}(\theta )=\frac{{\pi}_{\theta}({a}_{t}{s}_{t})}{{\pi}_{{\theta}_{old}}({a}_{t}{s}_{t})}$, the PPO objective is:
$${L}^{CLIP}(\theta )={\mathbb{E}}_{t}\mathrm{min}\{{w}_{t}(\theta ){A}_{t},\text{\U0001d68c\U0001d695\U0001d692\U0001d699}({w}_{t}(\theta ),1\u03f5,1+\u03f5){A}_{t}\}$$ 
We can adapt our approximated hierarchical policy gradient with the same approach by letting ${w}_{h,kp}(\theta )=\frac{{\pi}_{{\theta}_{h}}({z}_{kp}{s}_{kp})}{{\pi}_{{\theta}_{h,old}}({z}_{kp}{s}_{kp})}$ and ${w}_{l,t}(\theta )=\frac{{\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}_{kp})}{{\pi}_{{\theta}_{l,old}}({a}_{t}{s}_{t},{z}_{kp})}$, and using the superindex clip to denote the clipped objective version, we obtain the new surrogate objective:
${L}_{HiPPO}^{CLIP}(\theta )={\mathbb{E}}_{\tau}[$  $\sum _{k=0}^{H/p}}\mathrm{min}\{{w}_{h,kp}(\theta )A({s}_{kp},{z}_{kp}),{w}_{h,kp}^{\text{\U0001d68c\U0001d695\U0001d692\U0001d699}}(\theta )A({s}_{kp},{z}_{kp})\$  
$+$  $\sum _{t=0}^{H}}\mathrm{min}\{{w}_{l,t}(\theta )A({s}_{t},{a}_{t},{z}_{kp}),{w}_{l,t}^{\text{\U0001d68c\U0001d695\U0001d692\U0001d699}}(\theta )A({s}_{t},{a}_{t},{z}_{kp})\}]$ 
We call this algorithm Hierarchical Proximal Policy Optimization (HiPPO). Next, we introduce a critical additions: a switching of the timecommitment between skills.
4.4 Varying Timecommitment
Most hierarchical methods either consider a fixed timecommitment to the lower level skills (florensa2017snn; frans2018mlsh), or implement the complex options framework (Precup2000TemporalLearning; bacon2017option). In this work we propose an inbetween, where the timecommitment to the skills is a random variable sampled from a fixed distribution $\text{\U0001d672\U0001d68a\U0001d69d\U0001d68e\U0001d690\U0001d698\U0001d69b\U0001d692\U0001d68c\U0001d68a\U0001d695}({T}_{\mathrm{min}},{T}_{\mathrm{max}})$ just before the manager takes a decision. This modification does not hinder final performance, and we show it improves zeroshot adaptation to a new task. This approach to sampling rollouts is detailed in Algorithm 1. The full algorithm is detailed in Algorithm 2.
5 Experiments




We designed our experiments to answer the following questions: 1) How does HiPPO compare against a flat policy when learning from scratch? 2) Does it lead to policies more robust to environment changes? 3) How well does it adapt already learned skills? and 4) Does our skill diversity assumption hold in practice?
5.1 Tasks
We evaluate our approach on a variety of robotic locomotion and navigation tasks. The Block environments, depicted in Fig. 1(a)1(b), have walls of random heights at regular intervals, and the objective is to learn a gait for the Hopper and HalfCheetah robots to jump over them. The agents observe the height of the wall ahead and their proprioceptive information (joint positions and velocities), receiving a reward of +1 for each wall cleared. The Gather environments, described by duan2016benchmarking, require agents to collect apples (green balls, +1 reward) while avoiding bombs (red balls, 1 reward). The only available perception beyond proprioception is through a LIDARtype sensor indicating at what distance are the objects in different directions, and their type, as depicted in the bottom left corner of Fig. 1(c)1(d). This is challenging hierarchical task with sparse rewards that requires simultaneously learning perception, locomotion, and higherlevel planning capabilities. We use the Snake and Ant robots in Gather. Details for all robotic agents are provided in Appendix B.
5.2 Learning from Scratch and TimeCommitment
In this section, we study the benefit of using our HiPPO algorithm instead of standard PPO on a flat policy (schulman2017ppo). The results, reported in Figure 3, demonstrate that training from scratch with HiPPO leads to faster learning and better performance than flat PPO. Furthermore, we show that the benefit of HiPPO does not just come from having temporally correlated exploration: PPO with action repeat converges at a lower performance than our method. HiPPO leverages the timecommitment more efficiently, as suggested by the poor performance of the ablation where we set $p=1$, when the manager takes an action every environment step as well. Finally, Figure 4 shows the effectiveness of using the presented skilldependent baseline.















5.3 Comparison to Other Methods
We compare HiPPO to current stateoftheart hierarchical methods. First, we evaluate HIRO (nachum2018hiro), an offpolicy RL method based on training a goalreaching lower level policy. Fig. 5 shows that HIRO achieves poor performance on our tasks. As further detailed in Appendix D, this algorithm is sensitive to access to groundtruth information, like the exact $(x,y)$ position of the robot in Gather. In contrast, our method is able to perform well directly from the raw sensory inputs described in Section 5.1. We evaluate OptionCritic (bacon2017option), a variant of the options framework (sutton1999temporal) that can be used for continuous actionspaces. It fails to learn, and we hypothesize that their algorithm provides less timecorrelated exploration and learns less diverse skills. We also compare against MLSH (frans2018mlsh), which repeatedly samples new environment configurations to learn primitive skills. We take these hyperparameters from their Ant Twowalk experiment: resetting the environment configuration every 60 iterations, a warmup period of 20 during which only the manager is trained, and a joint training period of 40 during which both manager and skills are trained. Our results show that such a training scheme does not provide any benefits. Finally, we provide a comparison to a direct application of our Hierarchical Vanilla Policy Gradient (HierVPG) algorithm, and we see that the algorithm is unstable without PPO’s trustregionlike technique.
5.4 Robustness to Dynamics Perturbations
We investigate the robustness of HiPPO to changes in the dynamics of the environment. We perform several modifications to the base Snake Gather and Ant Gather environments. One at a time, we change the body mass, dampening of the joints, body inertia, and friction characteristics of both robots. The results, presented in Table 1, show that HiPPO with randomized period $\text{\U0001d672\U0001d68a\U0001d69d\U0001d68e\U0001d690\U0001d698\U0001d69b\U0001d692\U0001d68c\U0001d68a\U0001d695}([{T}_{\mathrm{min}},{T}_{\mathrm{max}}])$ is able to better handle these dynamics changes. In terms of the drop in policy performance between the training environment and test environment, it outperforms HiPPO with fixed period on 6 out of 8 related tasks. These results suggest that the randomized period exposes the policy to a wide range of scenarios, which makes it easier to adapt when the environment changes.
Gather  Algorithm  Initial  Mass  Dampening  Inertia  Friction 
Snake  Flat PPO  2.72  3.16 (+16%)  2.75 (+1%)  2.11 (22%)  2.75 (+1%) 
HiPPO, $p=10$  4.38  3.28 (25%)  3.27 (25%)  3.03 (31%)  3.27 (25%)  
HiPPO random $p$  5.11  4.09 (20%)  4.03 (21%)  3.21 (37%)  4.03 (21%)  
Ant  Flat PPO  2.25  2.53 (+12%)  2.13 (5%)  2.36 (+5%)  1.96 (13%) 
HiPPO, $p=10$  3.84  3.31 (14%)  3.37 (12%)  2.88 (25%)  3.07 (20%)  
HiPPO random $p$  3.22  3.37 (+5%)  2.57 (20%)  3.36 (+4%)  2.84 (12%) 
5.5 Adaptation of PreTrained Skills
For the Block task, we use DIAYN (Eysenbach2019diayn) to train 6 differentiated subpolicies in an environment without any walls. Here, we see if these diverse skills can improve performance on a downstream task that’s out of the training distribution. For Gather, we take 6 pretrained subpolicies encoded by a Stochastic Neural Network (tang2013sfnn) that was trained in a diversitypromoting environment (florensa2017snn). We finetune them with HiPPO on the Gather environment, but with an extra penalty on the velocity of the Center of Mass. This can be understood as a preference for cautious behavior. This requires adjustment of the subpolicies, which were trained with a proxy reward encouraging them to move as far as possible (and hence quickly). Fig. 6 shows that using HiPPO to simultaneously train a manager and finetune the skills achieves higher final performance than fixing the subpolicies and only training a manager with PPO. The two initially learn at the same rate, but HiPPO’s ability to adjust to the new dynamics allows it to reach a higher final performance. Fig. 6 also shows that HiPPO can finetune the same given skills better than OptionCritic (bacon2017option), MLSH (frans2018mlsh), and HIRO (nachum2018hiro).





5.6 Skill Diversity Assumption
In Lemma 1, we derived a more efficient and numerically stable gradient by assuming that the subpolicies are diverse. In this section, we empirically test the validity of our assumption and the quality of our approximation. We run the HiPPO algorithm on Ant Gather and Snake Gather both from scratch and with given pretrained skills, as done in the previous section. In Table 2, we report the average maximum probability under other subpolicies, corresponding to $\u03f5$ from the assumption. In all settings, this is on the order of magnitude of 0.1. Therefore, under the $p\approx 10$ that we use in our experiments, the term we neglect has a factor ${\u03f5}^{p1}={10}^{10}$. It is not surprising then that the average cosine similarity between the full gradient and our approximation is almost 1, as reported in Table 2.
Gather  Algorithm  Cosine Sim.  ${\mathrm{max}}_{{z}^{\prime}\ne {z}_{kp}}{\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}^{\prime})$  ${\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}_{kp})$ 
Snake  HiPPO on given skills  $0.98\pm 0.01$  $0.09\pm 0.04$  $0.44\pm 0.03$ 
HiPPO on random skills  $0.97\pm 0.03$  $0.12\pm 0.03$  $0.32\pm 0.04$  
Ant  HiPPO on given skills  $0.96\pm 0.04$  $0.11\pm 0.05$  $0.40\pm 0.08$ 
HiPPO on random skills  $0.94\pm 0.03$  $0.13\pm 0.05$  $0.31\pm 0.09$ 
6 Conclusions and Future Work
In this paper, we examined how to effectively adapt temporal hierarchies. We began by deriving a hierarchical policy gradient and approximation of it. We then proposed a new method, HiPPO, that can stably train multiple layers of a hierarchy jointly. The adaptation experiments suggest that we can optimize pretrained skills for downstream environments, and learn emergent skills without any unsupervised pretraining. We also demonstrate that HiPPO with randomized period can learn from scratch on sparsereward and long time horizon tasks, while outperforming nonhierarchical methods on zeroshot transfer.
References
Appendix A Hyperparameters and Architectures
The Block environments used a horizon of 1000 and a batch size of 50,000, while Gather used a batch size of 100,000. Ant Gather has a horizon of 5000, while Snake Gather has a horizon of 8000 due to its larger size. For all experiments, both PPO and HiPPO used learning rate $3\times {10}^{3}$, clipping parameter $\u03f5=0.1$, 10 gradient updates per iteration, and discount $\gamma =0.999$. The learning rate, clipping parameter, and number of gradient updates come from the OpenAI Baselines implementation.
HiPPO used $n=6$ subpolicies. HiPPO uses a manager network with 2 hidden layers of 32 units, and a skill network with 2 hidden layers of 64 units. In order to have roughly the same number of parameters for each algorithm, flat PPO uses a network with 2 hidden layers with 256 and 64 units respectively. For HiPPO with randomized period, we resample $p\sim \text{Uniform}\{5,15\}$ every time the manager network outputs a latent, and provide the number of timesteps until the next latent selection as an input into both the manager and skill networks. The single baselines and skilldependent baselines used a MLP with 2 hidden layers of 32 units to fit the value function. The skilldependent baseline receives, in addition to the full observation, the active latent code and the time remaining until the next skill sampling. All runs used five random seeds.
Appendix B Robot Agent Description
Hopper is a 3link robot with a 14dimensional observation space and a 3dimensional action space. HalfCheetah has a 20dimensional observation space and a 6dimensional action space. We evaluate both of these agents on a sparse block hopping task. In addition to observing their own joint angles and positions, they observe the height and length of the next wall, the xposition of the next wall, and the distance to the wall from the agent. We also provide the same wall observations for the previous wall, which the agent can still interact with.
Snake is a 5link robot with a 17dimensional observation space and a 4dimensional action space. Ant is a quadrupedal robot with a 27dimensional observation space and a 8dimensional action space. Both Ant and Snake can move and rotate in all directions, and Ant faces the added challenge of avoiding falling over irrecoverably. In the Gather environment, agents also receive 2 sets of 10dimensional lidar observations, whcih correspond to separate apple and bomb observations. The observation displays the distance to the nearest apple or bomb in each ${36}^{\circ}$ bin, respectively. All environments are simulated with the physics engine MuJoCo (Todorov2012MuJoCoControl).
Appendix C Proofs
Lemma 1. If the skills are sufficiently differentiated, then the latent variable can be treated as part of the observation to compute the gradient of the trajectory probability. Concretely, if ${\pi}_{{\theta}_{h}}(zs)$ and ${\pi}_{{\theta}_{l}}(as,z)$ are Lipschitz in their parameters, and $$, then
${\nabla}_{\theta}\mathrm{log}P(\tau )={\displaystyle \sum _{k=0}^{H/p}}{\nabla}_{\theta}\mathrm{log}{\pi}_{{\theta}_{h}}({z}_{kp}{s}_{kp})+{\displaystyle \sum _{t=1}^{p}}{\nabla}_{\theta}\mathrm{log}{\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}_{kp})+\mathcal{O}(nH{\u03f5}^{p1})$  (5) 
Proof.
From the point of view of the MDP, a trajectory is a sequence $\tau =({s}_{0},{a}_{0},{s}_{1},{a}_{1},\mathrm{\dots},{a}_{H1},{s}_{H})$. Let’s assume we use the hierarchical policy introduced above, with a higherlevel policy modeled as a parameterized discrete distribution with $n$ possible outcomes ${\pi}_{{\theta}_{h}}(zs)=Categorica{l}_{{\theta}_{h}}(n)$. We can expand $P(\tau )$ into the product of policy and environment dynamics terms, with ${z}_{j}$ denoting the $j$th possible value out of the $n$ choices,
$$P(\tau )=\left(\prod _{k=0}^{H/p}\left[\sum _{j=1}^{n}{\pi}_{{\theta}_{h}}({z}_{j}{s}_{kp})\prod _{t=kp}^{(k+1)p1}{\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}_{j})\right]\right)\left[P({s}_{0})\prod _{t=1}^{H}P({s}_{t+1}{s}_{t},{a}_{t})\right]$$ 
Taking the gradient of $\mathrm{log}P(\tau )$ with respect to the policy parameters $\theta =[{\theta}_{h},{\theta}_{l}]$, the dynamics terms disappear, leaving:
${\nabla}_{\theta}\mathrm{log}P(\tau )$  $={\displaystyle \sum _{k=0}^{H/p}}{\nabla}_{\theta}\mathrm{log}\left({\displaystyle \sum _{j=1}^{n}}{\pi}_{{\theta}_{l}}({z}_{j}{s}_{kp}){\displaystyle \prod _{t=kp}^{(k+1)p1}}{\pi}_{s,\theta}({a}_{t}{s}_{t},{z}_{j})\right)$  
$={\displaystyle \sum _{k=0}^{H/p}}{\displaystyle \frac{1}{{\sum}_{j=1}^{n}{\pi}_{{\theta}_{h}}({z}_{j}{s}_{kp}){\prod}_{t=kp}^{(k+1)p1}{\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}_{j})}}{\displaystyle \sum _{j=1}^{n}}{\nabla}_{\theta}\left({\pi}_{{\theta}_{h}}({z}_{j}{s}_{kp}){\displaystyle \prod _{t=kp}^{(k+1)p1}}{\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}_{j})\right)$ 
The sum over possible values of $z$ prevents the logarithm from splitting the product over the $p$step subtrajectories. This term is problematic, as this product quickly approaches $0$ as $p$ increases, and suffers from considerable numerical instabilities. Instead, we want to approximate this sum of products by a single one of the terms, which can then be decomposed into a sum of logs. For this we study each of the terms in the sum: the gradient of a subtrajectory probability under a specific latent ${\nabla}_{\theta}\left({\pi}_{{\theta}_{h}}({z}_{j}{s}_{kp}){\prod}_{t=kp}^{(k+1)p1}{\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}_{j})\right)$. Now we can use the assumption that the skills are easy to distinguish, $$. Therefore, the probability of the subtrajectory under a latent different than the one that was originally sampled ${z}_{j}\ne {z}_{kp}$, is upper bounded by ${\u03f5}^{p}$. Taking the gradient, applying the product rule, and the Lipschitz continuity of the policies, we obtain that for all ${z}_{j}\ne {z}_{kp}$,
${\nabla}_{\theta}\left({\pi}_{{\theta}_{h}}({z}_{j}{s}_{kp}){\displaystyle \prod _{t=kp}^{(k+1)p1}}{\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}_{j})\right)$  $={\nabla}_{\theta}{\pi}_{{\theta}_{h}}({z}_{j}{s}_{kp}){\displaystyle \prod _{t=kp}^{(k+1)p1}}{\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}_{j})+$  
$\sum _{t=kp}^{(k+1)p1}}{\pi}_{{\theta}_{h}}({z}_{j}{s}_{kp})\left({\nabla}_{\theta}{\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}_{j})\right){\displaystyle \prod _{\begin{array}{c}t=kp\\ {t}^{\prime}\ne t\end{array}}^{(k+1)p1}}{\pi}_{{\theta}_{l}}({a}_{{t}^{\prime}}{s}_{{t}^{\prime}},{z}_{j})$  
$=\mathcal{O}(p{\u03f5}^{p1})$ 
Thus, we can across the board replace the summation over latents by the single term corresponding to the latent that was sampled at that time.
${\nabla}_{\theta}\mathrm{log}P(\tau )$  $={\displaystyle \sum _{k=0}^{H/p}}{\displaystyle \frac{1}{{\pi}_{{\theta}_{h}}({z}_{kp}{s}_{kp}){\prod}_{t=kp}^{(k+1)p1}{\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}_{kp})}}{\nabla}_{\theta}\left(P({z}_{kp}{s}_{kp}){\displaystyle \prod _{t=kp}^{(k+1)p1}}{\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}_{kp})\right)+{\displaystyle \frac{nH}{p}}\mathcal{O}(p{\u03f5}^{p1})$  
$={\displaystyle \sum _{k=0}^{H/p}}{\nabla}_{\theta}\mathrm{log}\left({\pi}_{{\theta}_{h}}({z}_{kp}{s}_{kp}){\displaystyle \prod _{t=kp}^{(k+1)p1}}{\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}_{kp})\right)+\mathcal{O}(nH{\u03f5}^{p1})$  
$={\mathbb{E}}_{\tau}\left[\left({\displaystyle \sum _{k=0}^{H/p}}{\nabla}_{\theta}\mathrm{log}{\pi}_{{\theta}_{h}}({z}_{kp}{s}_{kp})+{\displaystyle \sum _{t=1}^{H}}{\nabla}_{\theta}\mathrm{log}{\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}_{kp})\right)\right]+\mathcal{O}(nH{\u03f5}^{p1})$ 
Interestingly, this is exactly ${\nabla}_{\theta}P({s}_{0},{z}_{0},{a}_{0},{s}_{1},\mathrm{\dots})$. In other words, it’s the gradient of the probability of that trajectory, where the trajectory now includes the variables $z$ as if they were observed.
∎
Lemma 2. For any functions ${b}_{h}:\mathcal{S}\to \mathbb{R}$ and ${b}_{l}:\mathcal{S}\times \mathcal{Z}\to \mathbb{R}$ we have:
$${\mathbb{E}}_{\tau}[\sum _{k=0}^{H/p}{\nabla}_{\theta}\mathrm{log}P({z}_{kp}{s}_{kp})b({s}_{kp})]=0$$ 
$${\mathbb{E}}_{\tau}[\sum _{t=0}^{H}{\nabla}_{\theta}\mathrm{log}{\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}_{kp})b({s}_{t},{z}_{kp})]=0$$ 
Proof.
We can use the tower property as well as the fact that the interior expression only depends on ${s}_{kp}$ and ${z}_{kp}$:
${\mathbb{E}}_{\tau}[{\displaystyle \sum _{k=0}^{H/p}}{\nabla}_{\theta}\mathrm{log}P({z}_{kp}{s}_{kp})b({s}_{kp})]$  $={\displaystyle \sum _{k=0}^{H/p}}{\mathbb{E}}_{{s}_{kp},{z}_{kp}}[{\mathbb{E}}_{\tau \setminus {s}_{kp},{z}_{kp}}[{\nabla}_{\theta}\mathrm{log}P({z}_{kp}{s}_{kp})b({s}_{kp})]]$  
$={\displaystyle \sum _{k=0}^{H/p}}{\mathbb{E}}_{{s}_{kp},{z}_{kp}}[{\nabla}_{\theta}\mathrm{log}P({z}_{kp}{s}_{kp})b({s}_{kp})]$ 
Then, we can write out the definition of the expectation and undo the gradientlog trick to prove that the baseline is unbiased.
${\mathbb{E}}_{\tau}[{\displaystyle \sum _{k=0}^{H/p}}{\nabla}_{\theta}\mathrm{log}{\pi}_{{\theta}_{h}}({z}_{kp}{s}_{kp})b({s}_{kp})]$  $={\displaystyle \sum _{k=0}^{H/p}}{\displaystyle {\int}_{({s}_{kp},{z}_{kp})}}P({s}_{kp},{z}_{kp}){\nabla}_{\theta}\mathrm{log}{\pi}_{{\theta}_{h}}({z}_{kp}{s}_{kp})b({s}_{kp})\mathit{d}{z}_{kp}\mathit{d}{s}_{kp}$  
$={\displaystyle \sum _{k=0}^{H/p}}{\displaystyle {\int}_{{s}_{kp}}}P({s}_{kp})b({s}_{kp}){\displaystyle {\int}_{{z}_{kp}}}{\pi}_{{\theta}_{h}}({z}_{kp}{s}_{kp}){\nabla}_{\theta}\mathrm{log}{\pi}_{{\theta}_{h}}({z}_{kp}{s}_{kp})\mathit{d}{z}_{kp}\mathit{d}{s}_{kp}$  
$={\displaystyle \sum _{k=0}^{H/p}}{\displaystyle {\int}_{{s}_{kp}}}P({s}_{kp})b({s}_{kp}){\displaystyle {\int}_{{z}_{kp}}}{\pi}_{{\theta}_{h}}({z}_{kp}{s}_{kp}){\displaystyle \frac{1}{{\pi}_{{\theta}_{h}}({z}_{kp}{s}_{kp})}}{\nabla}_{\theta}{\pi}_{{\theta}_{h}}({z}_{kp}{s}_{kp})\mathit{d}{z}_{kp}\mathit{d}{s}_{kp}$  
$={\displaystyle \sum _{k=0}^{H/p}}{\displaystyle {\int}_{{s}_{kp}}}P({s}_{kp})b({s}_{kp}){\nabla}_{\theta}{\displaystyle {\int}_{{z}_{kp}}}{\pi}_{{\theta}_{h}}({z}_{kp}{s}_{kp})\mathit{d}{z}_{kp}\mathit{d}{s}_{kp}$  
$={\displaystyle \sum _{k=0}^{H/p}}{\displaystyle {\int}_{{s}_{kp}}}P({s}_{kp})b({s}_{kp}){\nabla}_{\theta}1d{s}_{kp}$  
$=0$ 
∎
Subtracting a state and subpolicy dependent baseline from the second term is also unbiased, i.e.
$${\mathbb{E}}_{\tau}[\sum _{t=0}^{H}{\nabla}_{\theta}\mathrm{log}{\pi}_{s,\theta}({a}_{t}{s}_{t},{z}_{kp})b({s}_{t},{z}_{kp})]=0$$ 
We’ll follow the same strategy to prove the second equality: apply the tower property, express the expectation as an integral, and undo the gradientlog trick.
${\mathbb{E}}_{\tau}[{\displaystyle \sum _{t=0}^{H}}$  ${\nabla}_{\theta}\mathrm{log}{\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}_{kp})b({s}_{t},{z}_{kp})]$  
$={\displaystyle \sum _{t=0}^{H}}{\mathbb{E}}_{{s}_{t},{a}_{t},{z}_{kp}}[{\mathbb{E}}_{\tau \setminus {s}_{t},{a}_{t},{z}_{kp}}[{\nabla}_{\theta}\mathrm{log}{\pi}_{{\theta}_{m}}({a}_{t}{s}_{t},{z}_{kp})b({s}_{t},{z}_{kp})]]$  
$={\displaystyle \sum _{t=0}^{H}}{\mathbb{E}}_{{s}_{t},{a}_{t},{z}_{kp}}[{\nabla}_{\theta}\mathrm{log}{\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}_{kp})b({s}_{kp},{z}_{kp})]$  
$={\displaystyle \sum _{t=0}^{H}}{\displaystyle {\int}_{({s}_{t},{z}_{kp})}}P({s}_{t},{z}_{kp})b({s}_{t},{z}_{kp}){\displaystyle {\int}_{{a}_{t}}}{\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}_{kp}){\nabla}_{\theta}\mathrm{log}{\pi}_{{\theta}_{l}}({a}_{t}{s}_{t},{z}_{kp})\mathit{d}{a}_{t}\mathit{d}{z}_{kp}\mathit{d}{s}_{t}$  
$={\displaystyle \sum _{t=0}^{H}}{\displaystyle {\int}_{({s}_{t},{z}_{kp})}}P({s}_{t},{z}_{kp})b({s}_{t},{z}_{kp}){\nabla}_{\theta}1d{z}_{kp}d{s}_{t}$  
$=0$ 
Appendix D HIRO sensitivity to observationspace
In this section we provide a more detailed explanation of why HIRO (nachum2018hiro) performs poorly under our environments. As explained in our related work section, HIRO belongs to the general category of algorithms that train goalreaching policies as lower levels of the hierarchy (vezhnevets2017fun; levy2017hac). These methods rely on having a goalspace that is meaningful for the task at hand. For example, in navigation tasks they require having access to the $(x,y)$ position of the agent such that deltas in that space can be given as meaningful goals to move in the environment. Unfortunately, in many cases the only readily available information (if there’s no GPS signal or other positioning system installed) are raw sensory inputs, like cameras or the LIDAR sensors we mimic in our environments. In such cases, our method still performs well because it doesn’t rely on the goalreaching extra supervision that is leveraged (and detrimental in this case) in HIRO and similar methods. In Figure 7, we show that knowing the ground truth location is critical for its success. We have reproduced the HIRO results in Fig. 7 using the published codebase, so we are convinced that our results showcase a failure mode of HIRO.
Appendix E Hyperparameter Sensitivity Plots



