Abstract
Conventionally, modelbased reinforcement learning (MBRL) aims to learn aglobal model for the dynamics of the environment. A good model can potentiallyenable planning algorithms to generate a large variety of behaviors and solvediverse tasks. However, learning an accurate model for complex dynamicalsystems is difficult, and even then, the model might not generalize welloutside the distribution of states on which it was trained. In this work, wecombine modelbased learning with modelfree learning of primitives that makemodelbased planning easy. To that end, we aim to answer the question: how canwe discover skills whose outcomes are easy to predict? We propose anunsupervised learning algorithm, DynamicsAware Discovery of Skills (DADS),which simultaneously discovers predictable behaviors and learns their dynamics.Our method can leverage continuous skill spaces, theoretically, allowing us tolearn infinitely many behaviors even for highdimensional statespaces. Wedemonstrate that zeroshot planning in the learned latent space significantlyoutperforms standard MBRL and modelfree goalconditioned RL, can handlesparsereward tasks, and substantially improves over prior hierarchical RLmethods for unsupervised skill discovery.
Quick Read (beta)
DynamicsAware Unsupervised Discovery of Skills
Abstract
Conventionally, modelbased reinforcement learning (MBRL) aims to learn a global model for the dynamics of the environment. A good model can potentially enable planning algorithms to generate a large variety of behaviors and solve diverse tasks. However, learning an accurate model for complex dynamical systems is difficult, and even then, the model might not generalize well outside the distribution of states on which it was trained. In this work, we combine modelbased learning with modelfree learning of primitives that make modelbased planning easy. To that end, we aim to answer the question: how can we discover skills whose outcomes are easy to predict? We propose an unsupervised learning algorithm, DynamicsAware Discovery of Skills (DADS), which simultaneously discovers predictable behaviors and learns their dynamics. Our method can leverage continuous skill spaces, theoretically, allowing us to learn infinitely many behaviors even for highdimensional statespaces. We demonstrate that zeroshot planning in the learned latent space significantly outperforms standard MBRL and modelfree goalconditioned RL, can handle sparsereward tasks, and substantially improves over prior hierarchical RL methods for unsupervised skill discovery. Video demonstration of our results are available at: https://sites.google.com/view/dadsskill
bayesnet \usetikzlibraryarrows
DynamicsAware Unsupervised Discovery of Skills
Archit Sharma^{†}^{†}thanks: Work done as a member of Google AI Residency Program (g.co/airesidency)., Shixiang Gu, Sergey Levine, Vikash Kumar, Karol Hausman Google Brain {architsh, shanegu, slevine, vikashplus, karolhausman}@google.com
noticebox[b]Preprint. Under review.\[email protected]
1 Introduction
Deep reinforcement learning (RL) enables autonomous learning of diverse and complex tasks with rich sensory inputs, temporally extended goals, and challenging dynamics, such as discrete gameplaying domains (Mnih et al., 2013; Silver et al., 2016), and continuous control domains including locomotion (Schulman et al., 2015; Heess et al., 2017) and manipulation (Rajeswaran et al., 2017; Kalashnikov et al., 2018; Gu et al., 2017). Most of the deep RL approaches learn a Qfunction or a policy that are directly optimized for the training task, which limits their generalization to new scenarios. In contrast, MBRL methods (Li and Todorov, 2004; Deisenroth and Rasmussen, 2011; Watter et al., 2015) can acquire dynamics models that may be utilized to perform unseen tasks at test time. While this capability has been demonstrated in some of the recent works (Levine et al., 2016; Nagabandi et al., 2018; Chua et al., 2018b; Kurutach et al., 2018; Ha and Schmidhuber, 2018), learning an accurate global model that works for all stateaction pairs can be exceedingly challenging, especially for highdimensional system with complex and discontinuous dynamics. The problem is further exacerbated as the learned global model has limited generalization outside of the state distribution it was trained on and exploring the whole state space is generally infeasible. Can we retain the flexibility of modelbased RL, while using modelfree RL to acquire proficient lowlevel behaviors under complex dynamics?
While learning a global dynamics model that captures all the different behaviors for the entire statespace can be extremely challenging, learning a model for a specific behavior that acts only in a small part of the statespace can be much easier. For example, consider learning a model for dynamics of all gaits of a quadruped versus a model which only works for a specific gait. If we can learn many such behaviors and their corresponding dynamics, we can leverage modelpredictive control to plan in the behavior space, as opposed to planning in the action space. The question then becomes: how do we acquire such behaviors, considering that behaviors could be random and unpredictable? To this end, we propose DynamicsAware Discovery of Skills (DADS), an unsupervised RL framework for learning lowlevel skills using modelfree RL with the explicit aim of making modelbased control easy. Skills obtained using DADS are directly optimized for predictability, providing a better representation on top of which predictive models can be learned. Crucially, the skills do not require any supervision to learn, and are acquired entirely through autonomous exploration. This means that the repertoire of skills and their predictive model are learned before the agent has been tasked with any goal or reward function. When a task is provided at testtime, the agent utilizes the previously learned skills and model to immediately perform the task without any further training.
The key contribution of our work is an unsupervised reinforcement learning algorithm, DADS, grounded in mutualinformationbased exploration. We demonstrate that our objective can embed learned primitives in continuous spaces, which allows us to learn a large, diverse set of skills. Crucially, our algorithm also learns to model the dynamics of the skills, which enables the use of modelbased planning algorithms for downstream tasks. We adapt the conventional model predictive control algorithms to plan in the space of primitives, and demonstrate that we can compose the learned primitives to solve downstream tasks without any additional training.
2 Preliminaries
Mutual information has been used as an objective to encourage exploration in reinforcement learning (Houthooft et al., 2016; Mohamed and Rezende, 2015). According to its definition, $\mathcal{I}(A;B)=\mathscr{H}(A)\mathscr{H}(A\mid B)$, optimizing mutual information $\mathcal{I}$ with respect to $B$ amounts to maximizing the entropy $\mathscr{H}$ of $A$ while minimizing the conditional entropy $\mathscr{H}(A\mid B)$. If $A$ is a function of the state and $B$ represents actions, this objective encourages the state entropy to be high, causing the underlying policy to be exploratory. Recently, multiple works (Eysenbach et al., 2018; Gregor et al., 2016; Achiam et al., 2018) apply this idea to learn diverse skills which maximally cover the state space.
To leverage planningbased control, MBRL estimates the true dynamics of the environment by learning a model $\widehat{p}({s}^{\prime}\mid s,a)$. This allows it to predict a trajectory of states ${\widehat{\tau}}_{H}=({s}_{t},{\widehat{s}}_{t+1},\mathrm{\dots}{\widehat{s}}_{t+H})$ resulting from a sequence of actions without any additional interaction with the environment. A similar simulation of the trajectory ${\widehat{\tau}}_{H}$ can be carried out using a model parameterized as $q({s}^{\prime}\mid s,z)$, where $z$ denotes the skill that is being executed. This modification to MBRL not only mandates the existence of a policy $\pi (\cdot \mid s,z)$ executing the actual actions in environment, but more importantly, the policy to execute these actions in a way that maintains predictability under $q$. In this setup, skills $z$ are effectively an abstraction for the actions ${a}_{1},{a}_{2}\mathrm{\dots}$ that are executed in the environment. This scheme forgoes a much harder task of learning a global model $\widehat{p}$, in exchange of a collection of potentially simpler models of behaviorspecific dynamics. In addition, the planning problem becomes easier as the planner is searching over a skill space $z$ that can act on longer horizons than granular actions $a$.
These seemingly unrelated ideas can be combined into a single optimization scheme, where we first discover skills (and their models) without any extrinsic reward and then compose these skills to optimize for the task defined at test time using modelbased planning. At train time, we assume a Markov Decision Process (MDP) ${\mathcal{M}}_{1}\equiv (\mathcal{S},\mathcal{A},p)$. The state space $\mathcal{S}$ and action space $\mathcal{A}$ are assumed to be continuous, and the $\mathcal{A}$ bounded. We assume the transition dynamics $p$ to be stochastic, such that $p:\mathcal{S}\times \mathcal{A}\times \mathcal{S}\mapsto [0,\mathrm{\infty})$. We learn a skillconditioned policy $\pi (a\mid s,z)$, where the skills $z$ belongs to the space $\mathcal{Z}$, detailed in Section 3. We assume that the skills are sampled from a prior $p(z)$ over $\mathcal{Z}$. We simultaneously learn a skillconditioned transition function $q({s}^{\prime}\mid s,z)$, coined as skilldynamics, which predicts the transition to the next state ${s}^{\prime}$ from the current state $s$ for the skill $z$ under the given dynamics $p$. At test time, we assume an MDP ${\mathcal{M}}_{2}\equiv (\mathcal{S},\mathcal{A},p,r)$, where $\mathcal{S},\mathcal{A},p$ match those defined in ${\mathcal{M}}_{1}$, and the reward function $r:\mathcal{S}\times \mathcal{A}\mapsto (\mathrm{\infty},\mathrm{\infty})$. We plan in $\mathcal{Z}$ using $q({s}^{\prime}\mid s,z)$ to compose the learned skills $z$ for optimizing $r$ in ${\mathcal{M}}_{2}$, which we detail in Section 4.
3 DynamicsAware Discovery of Skills (DADS)
We now establish a connection between mutualinformationbased exploration and modelbased RL by deriving an intrinsic reward that reflects predictability under skilldynamics. For an episodic setting with horizon $T$, we aim to maximize:
$\mathcal{I}({s}_{1},\mathrm{\dots}{s}_{T};z)\mathit{\hspace{1em}}s.t.\mathit{\hspace{1em}}{\displaystyle \sum _{t=1}^{T1}}\mathcal{I}({a}_{t};\{{s}_{t},z\})\le {I}_{c},$  (1) 
for an arbitrary constant upper bound ${I}_{c}$. The proposed objective encodes the intuition that every skill should be maximally informative about the resulting sequence of states ${s}_{1},\mathrm{\dots}{s}_{T}$ in the MDP ${\mathcal{M}}_{1}$, while being minimally informative about the sequence of actions used. For clarity of discussion, we defer a more rigorous justification for this informationbottleneckstyle (Tishby et al., 2000; Alemi et al., 2016) objective to the Appendix B. We simplify Eq. 1:
$\mathcal{I}({s}_{1},\mathrm{\dots}{s}_{T};z)$  $=\mathcal{I}({s}_{1};z)+\mathcal{I}({s}_{2};z\mid {s}_{1})\mathrm{\dots}\mathcal{I}({s}_{T};z\mid {s}_{T1},\mathrm{\dots}{s}_{1})$  (2)  
$=\mathcal{I}({s}_{1};z)+\mathcal{I}({s}_{2};z\mid {s}_{1})+\mathrm{\dots}\mathcal{I}({s}_{T};z\mid {s}_{T1})$  (3) 
by using the chain rule of mutual information to obtain Eq. 2, and the Markovian assumption of the ${\mathcal{M}}_{1}$ to obtain Eq. 3. Returning to Eq. 1, we obtain our objective $R(\pi )$ as:
$R(\pi )={\displaystyle \sum _{t=1}^{T1}}\mathcal{I}({s}_{t+1};z\mid {s}_{t})\beta \mathcal{I}({a}_{t};\{{s}_{t},z\})$  (4) 
where we formulate the dual objective using the Lagrangian multiplier $\beta $ (and ignore the constant $\beta {I}_{c}$). Using the definition of mutual information, the resulting objective is given by:
$R(\pi )$  $={\displaystyle \sum _{t=1}^{T1}}{\mathbb{E}}_{{\rho}_{z}({s}_{t},{a}_{t}),z}\left[\mathrm{log}{\displaystyle \frac{p({s}_{t+1}\mid {s}_{t},z)}{p({s}_{t+1}{s}_{t})}}\beta \mathrm{log}{\displaystyle \frac{\pi ({a}_{t}\mid {s}_{t},z)}{\pi ({a}_{t})}}\right]$  (5)  
$\ge {\displaystyle \sum _{t=1}^{T}}{\mathbb{E}}_{{\rho}_{z}({s}_{t},{a}_{t}),z}\left[\mathrm{log}{\displaystyle \frac{{q}_{\varphi}({s}_{t+1}\mid {s}_{t},z)}{p({s}_{t+1}\mid {s}_{t})}}\beta \mathrm{log}{\displaystyle \frac{\pi ({a}_{t}\mid {s}_{t},z)}{p(a)}}\right]$  (6)  
$R(\pi ,{q}_{\varphi})$  $\equiv {\displaystyle \sum _{t=1}^{T}}{\mathbb{E}}_{{\rho}_{z}({s}_{t},{a}_{t}),z}\left[\mathrm{log}{q}_{\varphi}({s}_{t+1}\mid {s}_{t},z)\mathrm{log}p({s}_{t+1}\mid {s}_{t})\beta \mathrm{log}\pi ({a}_{t}\mid {s}_{t},z)\right]$  (7) 
where ${\rho}_{z}({s}_{t},{a}_{t})$ represents the stationary stateaction distribution under the skill $z$. For Eq. 6, we use the nonnegativity of KLdivergence, that is $\int \pi ({a}_{t})\mathrm{log}\frac{\pi ({a}_{t})}{p({a}_{t})}d{a}_{t}\ge 0$, to replace the marginal over the policy $\pi ({a}_{t})$ with the uniform prior $p(a)$ over the bounded action space $\mathcal{A}$. Similarly, we use $\int p({s}_{t+1}{s}_{t},z)\mathrm{log}\frac{p({s}_{t+1}{s}_{t},z)}{{q}_{\varphi}({s}_{t+1}{s}_{t},z)}d{s}_{t+1}\ge 0$ to introduce skilldynamics as a parametric variational approximation ${q}_{\varphi}({s}^{\prime}s,z)$. Ignoring the constant $\beta \mathrm{log}p(a)$, we get our objective $R(\pi ,{q}_{\varphi})$ in Eq. 7.
Maximizing $R(\pi ,{q}_{\varphi})$ immediately suggests an alternating optimization scheme that is summarized in Figure 2. Note that the gradient for $\varphi $ can be expressed as:
${\nabla}_{\varphi}R(\pi ,{q}_{\varphi})={\displaystyle \sum _{t=1}^{T1}}{\mathbb{E}}_{{\rho}_{z}({s}_{t},{a}_{t}),z}[{\nabla}_{\varphi}\mathrm{log}{q}_{\varphi}({s}_{t+1}{s}_{t},z))],$  (8) 
which is simply maximizing the likelihood of the transitions generated by the current policy.
The optimization of the policy $\pi $ can be interpreted as entropyregularized RL with a reward function $\mathrm{log}{q}_{\varphi}({s}_{t+1}\mid {s}_{t},z)\mathrm{log}p({s}_{t+1}\mid {s}_{t})$. Unfortunately, $\mathrm{log}p({s}_{t+1}\mid {s}_{t})$ is intractable to compute so we need to resort to approximations. We choose to reuse the skill dynamics model to approximate $p({s}_{t+1}\mid {s}_{t})=\int p({s}_{t+1}\mid {s}_{t},z)p(z)\mathit{d}z\approx \frac{1}{L}{\sum}_{i=1}^{L}p({s}_{t+1}\mid {s}_{t},{z}_{i})\approx \frac{1}{L}{\sum}_{i=1}^{L}{q}_{\varphi}({s}_{t+1}\mid {s}_{t},{z}_{i})$, where ${z}_{i}$ is sampled from the prior $p(z)$. The final reward function can be written as:
${r}_{z}({s}_{t},{a}_{t},{s}_{t+1})=\mathrm{log}{\displaystyle \frac{{q}_{\varphi}({s}_{t+1}\mid {s}_{t},z)}{{\sum}_{i=1}^{L}{q}_{\varphi}({s}_{t+1}\mid {s}_{t},{z}_{i})}}+\mathrm{log}L,{z}_{i}\sim p(z).$  (9) 
In practice, we often reuse the sample $z$ in the denominator in Eq. 9 amongst the samples ${\{{z}_{i}\}}_{i=1}^{L}$, to obtain a softmaxlike construction, providing a smoother optimization landscape. For the actual algorithm, we collect a large onpolicy batch of data in every iteration, so that it contains experience collected from different skills. In order to take multiple gradient steps on the same batch of data, we use soft actorcritic (Haarnoja et al., 2018a, b) as the optimization algorithm for the policy $\pi $ (although our method is agnostic to the choice of the RL algorithm used to update the policy). The exact implementation details are discussed in the Appendix A.
4 Planning using Skill Dynamics
Given the learned skills $\pi (a\mid s,z)$ and their respective skilltransition dynamics ${q}_{\varphi}({s}^{\prime}\mid s,z)$, we can perform modelbased planning in the latent space $\mathcal{Z}$ to optimize for a reward $r$ that is given to an agent at test time. Note, that this essentially allows us to perform zeroshot planning given the unsupervised pretraining procedure described in Section 3.
In order to perform planning, we employ the modelpredictivecontrol (MPC) paradigm Garcia et al. (1989), which in a standard setting generates a set of action plans ${P}_{k}=({a}_{k,1},\mathrm{\dots}{a}_{k,H})\sim P$ for a planning horizon $H$. The MPC plans can be generated due to the fact that the planner is able to simulate the trajectory ${\widehat{\tau}}_{k}=({s}_{k,1},{a}_{k,1}\mathrm{\dots}{s}_{k,H+1})$ assuming access to the transition dynamics $\widehat{p}({s}^{\prime}\mid s,a)$. In addition, each plan computes the reward $r({\widehat{\tau}}_{k})$ for its trajectory according to the reward function $r$ that is provided for the testtime task. Following the MPC principle, the planner selects the best plan according to the reward function $r$ and executes its first action ${a}_{1}$. The planning algorithm repeats this procedure for the next state iteratively until it achieves its goal.
We use a similar strategy to design an MPC planner to exploit previouslylearned skilltransition dynamics ${q}_{\varphi}({s}^{\prime}\mid s,z)$. Note that unlike conventional modelbased RL, we generate a plan ${P}_{k}=({z}_{k,1},\mathrm{\dots}{z}_{k,{H}_{P}})$ in the latent space $\mathcal{Z}$ as opposed to the action space $\mathcal{A}$ that would be used by a standard planner. Since the primitives are temporally meaningful, it is beneficial to hold a primitive for a horizon ${H}_{Z}>1$, unlike actions which are usually held for a single step. Thus, effectively, the planning horizon for our latent space planner is $H={H}_{P}\times {H}_{Z}$, enabling longerhorizon planning using fewer primtiives. Similar to the standard MPC setting, the latent space planner simulates the trajectory ${\widehat{\tau}}_{k}=({s}_{k,1},{z}_{k,1},{a}_{k,1},{s}_{k,2},{z}_{k,2},{a}_{k,2},\mathrm{\dots}{s}_{k,H+1})$ and computes the reward $r({\widehat{\tau}}_{k})$. After a small number of trajectory samples, the planner selects the first latent action ${z}_{1}$ of the best plan, executes it for ${H}_{Z}$ steps in the environment, and the repeats the process until goal completion.
The latent planner $P$ maintains a distribution of latent plans, each of length ${H}_{P}$. Each element in the sequence represents the distribution of the primitive to be executed at that time step. For continuous spaces, each element of the sequence can be modelled using a normal distribution, $\mathcal{N}({\mu}_{1},\mathrm{\Sigma}),\mathrm{\dots}\mathcal{N}({\mu}_{{H}_{P}},\mathrm{\Sigma})$. We refine the planning distributions for $R$ steps, using $K$ samples of latent plans ${P}_{k}$, and compute the ${r}_{k}$ for the simulated trajectory ${\widehat{\tau}}_{k}$. The update for the parameters follows that in Model Predictive Path Integral (MPPI) controller Williams et al. (2016):
${\mu}_{i}={\displaystyle \sum _{k=1}^{K}}{\displaystyle \frac{\mathrm{exp}(\gamma {r}_{k})}{{\sum}_{p=1}^{K}\mathrm{exp}(\gamma {r}_{p})}}{z}_{k,i}\mathit{\hspace{1em}}\forall i=1,\mathrm{\dots}{H}_{P}$  (10) 
While we keep the covariance matrix of the distributions fixed, it is possible to update that as well as shown in Williams et al. (2016). We show an overview of the planning algorithm in Figure 3, and provide more implementation details in Appendix A.
5 Related Work
Central to our method is the concept of skill discovery via mutual information maximization. This principle, proposed in prior work that utilized purely modelfree unsupervised RL methods (Daniel et al., 2012; Florensa et al., 2017; Eysenbach et al., 2018; Gregor et al., 2016; WardeFarley et al., 2018), aims to learn diverse skills via a discriminability objective: a good set of skills is one where it is easy to distinguish the skills from each other, which means they perform distinct tasks and cover the space of possible behaviors. Building on this prior work, we distinguish our skills based on how they modify the original uncontrolled dynamics of the system. This simultaneously encourages the skills to be both diverse and predictable. We also demonstrate that constraining the skills to be predictable makes them more amenable for hierarchical composition and thus, more useful on downstream tasks.
Another line of work that is conceptually close to our method copes with intrinsic motivation (Oudeyer and Kaplan, 2009; Oudeyer et al., 2007; Schmidhuber, 2010) which is used to drive the agent’s exploration. Examples of such works include empowerment Klyubin et al. (2005); Mohamed and Rezende (2015), countbased exploration Bellemare et al. (2016); Oh et al. (2015); Tang et al. (2017); Fu et al. (2017), information gain about agent’s dynamics Stadie et al. (2015) and forwardinverse dynamics models Pathak et al. (2017). While our method uses an informationtheoretic objective that is similar to these approaches, it is used to learn a variety of skills that can be directly used for modelbased planning, which is in contrast to learning a better exploration policy for a single skill. We provide a discussion on the connection between empowerment and DADS in Appendix C.
The skills discovered using our approach can also provide extended actions and temporal abstraction, which enable more efficient exploration for the agent to solve various tasks, reminiscent of hierarchical RL (HRL) approaches. This ranges from the classic optioncritic architecture (Sutton et al., 1999; Stolle and Precup, 2002; Perkins et al., 1999) to some of the more recent work (Bacon et al., 2017; Vezhnevets et al., 2017; Nachum et al., 2018; Hausman et al., 2018). However, in contrast to endtoend HRL approaches (Heess et al., 2016; Peng et al., 2017), we can leverage a stable, twophase learning setup. The primitives learned through our method provide action and temporal abstraction, while planning with skilldynamics enables hierarchical composition of these primitives, bypassing many problems of endtoend HRL.
In the second phase of our approach, we use the learned skilltransition dynamics models to perform modelbased planning  an idea that has been explored numerous times in the literature. Modelbased reinforcement learning has been traditionally approached with methods that are wellsuited for lowdata regimes such as Gaussian Processes (Rasmussen, 2003) showing significant dataefficiency gains over modelfree approaches (Deisenroth et al., 2013; Kamthe and Deisenroth, 2017; Kocijan et al., 2004; Ko et al., 2007). More recently, due to the challenges of applying these methods to highdimensional state spaces, MBRL approaches employs Bayesian deep neural networks (Nagabandi et al., 2018; Chua et al., 2018b; Gal et al., 2016; Fu et al., 2016; Lenz et al., 2015) to learn dynamics models. In our approach, we take advantage of the deep dynamics models that are conditioned on the skill being executed, simplifying the modelling problem. In addition, the skills themselves are being learned with the objective of being predictable, further assists with the learning of the dynamics model. There also have been multiple approaches addressing the planning component of MBRL including linear controllers for local models (Levine et al., 2016; Kumar et al., 2016; Chebotar et al., 2017), uncertaintyaware (Chua et al., 2018b; Gal et al., 2016) or deterministic planners (Nagabandi et al., 2018) and stochastic optimization methods (Williams et al., 2016). The main contribution of our work lies in discovering modelbased skill primitives that can be further combined by a standard modelbased planner, therefore we take advantage of an existing planning approach  Model Predictive Path Integral (Williams et al., 2016) that can leverage our pretrained setting.
6 Experiments
Through our experiments, we aim to demonstrate that: (a) DADS as a general purpose skill discovery algorithm can scale to highdimensional problems; (b) discovered skills are amenable to hierarchical composition and; (c) not only is planning in the learned latent space feasible, but it is competitive to strong baselines. In Section 6.1, we provide visualizations and qualitative analysis of the skills learned using DADS. We demonstrate in Section 6.2 and Section 6.4 that optimizing the primitives for predictability renders skills more amenable to temporal composition that can be used for Hierarchical RL.We benchmark against stateoftheart modelbased RL baseline in Section 6.3, and against goalconditioned RL in Section 6.5.
6.1 Qualitative Analysis
In this section, we provide a qualitative discussion of the unsupervised skills learned using DADS. We use the MuJoCo environments (Todorov et al., 2012) from the OpenAI gym as our testbed (Brockman et al., 2016). We find that our proposed algorithm can learn diverse skills without any reward, even in problems with highdimensional state and actuation, as illustrated in Figure 4. DADS can discover primitives for HalfCheetah to run forward and backward with multiple different gaits, for Ant to navigate the environment using diverse locomotion primitives and for Humanoid to walk using stable locomotion primitives with diverse gaits and direction. The videos of the discovered primitives are available at: https://sites.google.com/view/dadsskill
Qualitatively, we find the skills discovered by DADS to be predictable and stable, in line with implicit constraints of the proposed objective. While the HalfCheetah will learn to run in both backward and forward directions, DADS will disincentivize skills which make HalfCheetah flip owing to the reduced predictability on landing. Similarly, skills discovered for Ant rarely flip over, and tend to provide stable navigation primitives in the environment. This also incentivizes the Humanoid, which is characteristically prone to collapsing and extremely unstable by design, to discover gaits which are stable for sustainable locomotion.
One of the significant advantages of the proposed objective is that it is compatible with continuous skill spaces, which has not been shown in prior work on skill discovery Eysenbach et al. (2018). Not only does this allow us to embed a large and diverse set of skills into a compact latent space, but also the smoothness of the learned space allows us to interpolate between behaviors generated in the environment. We demonstrate this on the Ant environment (Figure 5), where we learn twodimensional continuous skill space with a uniform prior over $(1,1)$ in each dimension, and compare it to a discrete skill space with a uniform prior over 20 skills. Similar to Eysenbach et al. (2018), we restrict the observation space of the skilldynamics $q$ to the cartesian coordinates $(x,y)$. We hereby call this the xy prior, and discuss its role in Section 6.2.
In Figure 5, we project the trajectories of the learned Ant skills from both discrete and continuous spaces onto the Cartesian plane. From the traces of the skills, it is clear that the continuous latent space can generate more diverse trajectories. We demonstrate in Section 6.3, that continuous primitives are more amenable to hierarchical composition and generally perform better on downstream tasks. More importantly, we observe that the learned skill space is semantically meaningful. The heatmap in Figure 5 shows the orientation of the trajectory (with respect to the xaxis) as a function of the skill $z\in \mathcal{Z}$, which varies smoothly as $z$ is varied, with explicit interpolations shown in Appendix D.
6.2 Skill Variance Analysis
In an unsupervised skill learning setup, it is important to optimize the primitives to be diverse. However, we argue that diversity is not sufficient for the learned primitives to be useful for downstream tasks. Primitives must exhibit lowvariance behavior, which enables longhorizon composition of the learned skills in a hierarchical setup. We analyze the variance of the xy trajectories in the environment, where we also benchmark the variance of the primitives learned by DIAYN (Eysenbach et al., 2018). For DIAYN, we use the xy prior for the skilldiscriminator, which biases the discovered skills to diversify in the xy space. This step was necessary for that baseline to obtain a competitive set of navigation skills. Figure 6 (Left) demonstrates that DADS, which optimizes the primitives for predictability and diversity, yields significantly lowervariance primitives when compared to DIAYN, which only optimizes for diversity. This is starkly demonstrated in the plots of XY traces of skills learned in different setups. Skills learned by DADS show significant control over the trajectories generated in the environment, while skills from DIAYN exhibit high variance in the environment, which limits their utility for hierarchical control. This is further demonstrated quantitatively in Section 6.4.
While optimizing for predictability already significantly reduces the variance of the trajectories generated by a primitive, we find that using the xy prior with DADS brings down the skill variance even further. For quantitative benchmarks in the next sections, we assume that the Ant skills are learned using an xy prior on the observation space, for both DADS and DIAYN.
6.3 ModelBased Reinforcement Learning
The key utility of learning a parametric model ${q}_{\varphi}({s}^{\prime}s,z)$ is to be enable use of planning algorithms for downstream tasks, which can be extremely sampleefficient. In our setup, we can solve testtime tasks in zeroshot, that is without any learning on the downstream task. We compare with the stateoftheart modelbased RL method (Chua et al., 2018a), which learns a dynamics model parameterized as $p({s}^{\prime}s,a)$, on the task of the Ant navigating to a specified goal with a dense reward. Given a goal $g$, reward at any position $u$ is given by $r(u)={\parallel gu\parallel}_{2}$. We benchmark our method against the following variants:

•
RandomMBRL (rMBRL): We train the model $p({s}^{\prime}s,a)$ on randomly collected trajectories, and test the zeroshot generalization of the model on a distribution of goals.

•
Weakoracle MBRL (WOMBRL): We train the model $p({s}^{\prime}s,a)$ on trajectories generated by the planner to navigate to a goal, randomly sampled in every episode. The distribution of goals during training matches the distribution at test time.

•
Strongoracle MBRL (SOMBRL): We train the model $p({s}^{\prime}s,a)$ on a trajectories generated by the planner to navigate to a specific goal, which is fixed for both training and test time.
Amongst the variants, only the rMBRL matches our assumptions of having an unsupervised taskagnostic training. Both WOMBRL and SOMBRL benefit from goaldirected exploration during training, a significant advantage over DADS, which only uses mutualinformationbased exploration.
We use $\mathrm{\Delta}={\sum}_{t=1}^{H}\frac{r(u)}{H{\parallel g\parallel}_{2}}$ as the metric, which represents the distance to the goal $g$ averaged over the episode (with the same fixed horizon $H$ for all models and experiments), normalized by the initial distance to the goal $g$. Therefore, lower $\mathrm{\Delta}$ indicates better performance and $$ (assuming the agent goes closer to the goal). The test set of goals is fixed for all the methods, sampled from ${[15,15]}^{2}$.
Figure 7 demonstrates that the zeroshot planning significantly outperforms all modelbased RL baselines, despite the advantage of the baselines being trained on the test goal(s). For the experiment depicted in Figure 7 (Right), DADS has an unsupervised pretraining phase, unlike SOMBRL which is training directly for the task. A comparison with RandomMBRL shows the significance of mutualinformationbased exploration, especially with the right parameterization and priors. This experiment also demonstrates the advantage of learning a continuous space of primitives, which outperforms planning on discrete primitives.
6.4 Hierarchical Control with Unsupervised Primitives
We benchmark hierarchical control for primitives learned without supervision, against our proposed scheme using an MPPI based planner on top of DADSlearned skills. We persist with the task of Antnavigation as described in 6.3. We benchmark against Hierarchical DIAYN (Eysenbach et al., 2018), which learns the skills using the DIAYN objective, freezes the lowlevel policy and learns a metacontroller that outputs the skill to be executed for the next ${H}_{Z}$ steps. We provide the xy prior to the DIAYN’s disciminator while learning the skills for the Ant agent. The performance of the metacontroller is constrained by the lowlevel policy, however, this hierarchical scheme is agnostic to the algorithm used to learn the lowlevel policy. To contrast the quality of primitives learned by the DADS and DIAYN, we also benchmark against Hierarchical DADS, which learns a metacontroller the same way as Hierarchical DIAYN, but learns the skills using DADS.
From Figure 8 (Left) We find that the metacontroller is unable to compose the skills learned by DIAYN, while the same metacontroller can learn to compose skills by DADS to navigate the Ant to different goals. This result seems to confirm our intuition described in Section 6.2, that the high variance of the DIAYN skills limits their temporal compositionality. Interestingly, learning a RL metacontroller reaches similar performance to the MPPI controller, taking an additional $200,000$ samples per goal.
6.5 Goalconditioned RL
To demonstrate the benefits of our approach over modelfree RL, we benchmark against goalconditioned RL on two versions of Antnavigation: (a) with a dense reward $r(u)$ and (b) with a sparse reward $r(u)=1$ if ${\parallel ug\parallel}_{2}\le \u03f5$, else 0. We train the goalconditioned RL agent using soft actorcritic, where the state variable of the agent is augmented with $ug$, the position delta to the goal. The agent gets a randomly sampled goal from ${[10,10]}^{2}$ at the beginning of the episode.
In Figure 8 (Right), we measure the average performance of the all the methods as a function of the initial distance of the goal, ranging from 5 to 30 metres. For dense reward navigation, we observe that while modelbased planning on DADSlearned skills degrades smoothly as the initial distance to goal to increases, goalconditioned RL experiences a sudden deterioration outside the goal distribution it was trained on. Even within the goal distribution observed during training of goalconditioned RL model, skillspace planning performs competitively to it. With sparse reward navigation, goalconditioned RL is unable to navigate, while MPPI demonstrates comparable performance to the dense reward up to about 20 metres. This highlights the utility of learning taskagnostic skills, which makes them more general while showing that latent space planning can be leveraged for tasks requiring longhorizon planning.
7 Conclusion
We have proposed a novel unsupervised skill learning algorithm that is amenable to modelbased planning for hierarchical control on downstream tasks. We show that our skill learning method can scale to highdimensional statespaces, while discovering a diverse set of lowvariance skills. In addition, we demonstrated that, without any training on the specified task, we can compose the learned skills to outperform competitive modelbased baselines that were trained with the knowledge of the test tasks. We plan to extend the algorithm to work with offpolicy data, potentially using relabelling tricks (Andrychowicz et al., 2017; Nachum et al., 2018) and explore more nuanced planning algorithms. We plan to apply the herebyintroduced method to different domains, such as manipulation and enable skill/model discovery directly from images, culminating into unsupervised skill discovery on robotic setups.
8 Acknowledgements
We would like to thank Evan Liu, Ben Eysenbach, Anusha Nagabandi for their help in reproducing the baselines for this work. We are thankful to Ben Eysenbach for their comments and discussion on the initial drafts. We would also like to acknowledge Ofir Nachum, Alex Alemi, Daniel Freeman, Yiding Jiang, Allan Zhou and other colleagues at Google Brain for their helpful feedback and discussions at various stages of this work. We are also thankful to Michael Ahn and others in Adept team for their support, especially with the infrastructure setup and scaling up the experiments.
References
 Abadi et al. (2015) M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
 Achiam et al. (2018) J. Achiam, H. Edwards, D. Amodei, and P. Abbeel. Variational option discovery algorithms. arXiv preprint arXiv:1807.10299, 2018.
 Agakov (2004) D. B. F. Agakov. The im algorithm: a variational approach to information maximization. Advances in Neural Information Processing Systems, 16:201, 2004.
 Alemi and Fischer (2018) A. A. Alemi and I. Fischer. Therml: Thermodynamics of machine learning. arXiv preprint arXiv:1807.04162, 2018.
 Alemi et al. (2016) A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.
 Andrychowicz et al. (2017) M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba. Hindsight experience replay. CoRR, abs/1707.01495, 2017. URL http://arxiv.org/abs/1707.01495.
 Bacon et al. (2017) P.L. Bacon, J. Harb, and D. Precup. The optioncritic architecture. In ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
 Bellemare et al. (2016) M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifying countbased exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.
 Brockman et al. (2016) G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. CoRR, abs/1606.01540, 2016. URL http://arxiv.org/abs/1606.01540.
 Chebotar et al. (2017) Y. Chebotar, K. Hausman, M. Zhang, G. Sukhatme, S. Schaal, and S. Levine. Combining modelbased and modelfree updates for trajectorycentric reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 703–711. JMLR. org, 2017.
 Chua et al. (2018a) K. Chua, R. Calandra, R. McAllister, and S. Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. CoRR, abs/1805.12114, 2018a. URL http://arxiv.org/abs/1805.12114.
 Chua et al. (2018b) K. Chua, R. Calandra, R. McAllister, and S. Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pages 4759–4770, 2018b.
 Csiszár and Matus (2003) I. Csiszár and F. Matus. Information projections revisited. IEEE Transactions on Information Theory, 49(6):1474–1490, 2003.
 Daniel et al. (2012) C. Daniel, G. Neumann, and J. Peters. Hierarchical relative entropy policy search. In Artificial Intelligence and Statistics, pages 273–281, 2012.
 Deisenroth and Rasmussen (2011) M. Deisenroth and C. E. Rasmussen. Pilco: A modelbased and dataefficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML11), pages 465–472, 2011.
 Deisenroth et al. (2013) M. P. Deisenroth, D. Fox, and C. E. Rasmussen. Gaussian processes for dataefficient learning in robotics and control. IEEE transactions on pattern analysis and machine intelligence, 37(2):408–423, 2013.
 Eysenbach et al. (2018) B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
 Florensa et al. (2017) C. Florensa, Y. Duan, and P. Abbeel. Stochastic neural networks for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012, 2017.
 Friedman et al. (2001) N. Friedman, O. Mosenzon, N. Slonim, and N. Tishby. Multivariate information bottleneck. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pages 152–161. Morgan Kaufmann Publishers Inc., 2001.
 Fu et al. (2016) J. Fu, S. Levine, and P. Abbeel. Oneshot learning of manipulation skills with online dynamics adaptation and neural network priors. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4019–4026. IEEE, 2016.
 Fu et al. (2017) J. Fu, J. CoReyes, and S. Levine. Ex2: Exploration with exemplar models for deep reinforcement learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2577–2587. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/6851ex2explorationwithexemplarmodelsfordeepreinforcementlearning.pdf.
 Gal et al. (2016) Y. Gal, R. McAllister, and C. E. Rasmussen. Improving pilco with bayesian neural network dynamics models. In DataEfficient Machine Learning workshop, ICML, volume 4, 2016.
 Garcia et al. (1989) C. E. Garcia, D. M. Prett, and M. Morari. Model predictive control: theory and practice—a survey. Automatica, 25(3):335–348, 1989.
 Gregor et al. (2016) K. Gregor, D. J. Rezende, and D. Wierstra. Variational intrinsic control. arXiv preprint arXiv:1611.07507, 2016.
 Gu et al. (2017) S. Gu, E. Holly, T. Lillicrap, and S. Levine. Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 3389–3396. IEEE, 2017.
 Ha and Schmidhuber (2018) D. Ha and J. Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, pages 2455–2467, 2018.
 Haarnoja et al. (2018a) T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018a.
 Haarnoja et al. (2018b) T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine. Soft actorcritic algorithms and applications. CoRR, abs/1812.05905, 2018b. URL http://arxiv.org/abs/1812.05905.
 Hausman et al. (2018) K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. Riedmiller. Learning an embedding space for transferable robot skills. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rk07ZXZRb.
 Heess et al. (2016) N. Heess, G. Wayne, Y. Tassa, T. Lillicrap, M. Riedmiller, and D. Silver. Learning and transfer of modulated locomotor controllers. arXiv preprint arXiv:1610.05182, 2016.
 Heess et al. (2017) N. Heess, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. Eslami, M. Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017.
 Houthooft et al. (2016) R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. D. Turck, and P. Abbeel. Curiositydriven exploration in deep reinforcement learning via bayesian neural networks. CoRR, abs/1605.09674, 2016. URL http://arxiv.org/abs/1605.09674.
 Jacobs et al. (1991) R. A. Jacobs, M. I. Jordan, S. J. Nowlan, G. E. Hinton, et al. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
 Kalashnikov et al. (2018) D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al. Qtopt: Scalable deep reinforcement learning for visionbased robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
 Kamthe and Deisenroth (2017) S. Kamthe and M. P. Deisenroth. Dataefficient reinforcement learning with probabilistic model predictive control. arXiv preprint arXiv:1706.06491, 2017.
 Kingma and Ba (2014) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Klyubin et al. (2005) A. S. Klyubin, D. Polani, and C. L. Nehaniv. Empowerment: A universal agentcentric measure of control. In 2005 IEEE Congress on Evolutionary Computation, volume 1, pages 128–135. IEEE, 2005.
 Ko et al. (2007) J. Ko, D. J. Klein, D. Fox, and D. Haehnel. Gaussian processes and reinforcement learning for identification and control of an autonomous blimp. In Proceedings 2007 ieee international conference on robotics and automation, pages 742–747. IEEE, 2007.
 Kocijan et al. (2004) J. Kocijan, R. MurraySmith, C. E. Rasmussen, and A. Girard. Gaussian process model based predictive control. In Proceedings of the 2004 American Control Conference, volume 3, pages 2214–2219. IEEE, 2004.
 Kumar et al. (2016) V. Kumar, E. Todorov, and S. Levine. Optimal control with learned local models: Application to dexterous manipulation. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 378–383. IEEE, 2016.
 Kurutach et al. (2018) T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel. Modelensemble trustregion policy optimization. arXiv preprint arXiv:1802.10592, 2018.
 Lenz et al. (2015) I. Lenz, R. A. Knepper, and A. Saxena. Deepmpc: Learning deep latent features for model predictive control. In Robotics: Science and Systems. Rome, Italy, 2015.
 Levine (2018) S. Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
 Levine et al. (2016) S. Levine, C. Finn, T. Darrell, and P. Abbeel. Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
 Li and Todorov (2004) W. Li and E. Todorov. Iterative linear quadratic regulator design for nonlinear biological movement systems. In ICINCO (1), pages 222–229, 2004.
 Mnih et al. (2013) V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Mohamed and Rezende (2015) S. Mohamed and D. J. Rezende. Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pages 2125–2133, 2015.
 Nachum et al. (2018) O. Nachum, S. S. Gu, H. Lee, and S. Levine. Dataefficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pages 3307–3317, 2018.
 Nagabandi et al. (2018) A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine. Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566. IEEE, 2018.
 Oh et al. (2015) J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Actionconditional video prediction using deep networks in atari games. In Advances in neural information processing systems, pages 2863–2871, 2015.
 Oudeyer and Kaplan (2009) P.Y. Oudeyer and F. Kaplan. What is intrinsic motivation? a typology of computational approaches. Frontiers in neurorobotics, 1:6, 2009.
 Oudeyer et al. (2007) P.Y. Oudeyer, F. Kaplan, and V. V. Hafner. Intrinsic motivation systems for autonomous mental development. IEEE transactions on evolutionary computation, 11(2):265–286, 2007.
 Pathak et al. (2017) D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiositydriven exploration by selfsupervised prediction. In ICML, 2017.
 Peng et al. (2017) X. B. Peng, G. Berseth, K. Yin, and M. Van De Panne. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Transactions on Graphics (TOG), 36(4):41, 2017.
 Perkins et al. (1999) T. J. Perkins, D. Precup, et al. Using options for knowledge transfer in reinforcement learning. University of Massachusetts, Amherst, MA, USA, Tech. Rep, 1999.
 Rajeswaran et al. (2017) A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087, 2017.
 Rasmussen (2003) C. E. Rasmussen. Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63–71. Springer, 2003.
 Schmidhuber (2010) J. Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010.
 Schulman et al. (2015) J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz. Trust region policy optimization. In Icml, volume 37, pages 1889–1897, 2015.
 Schulman et al. (2017) J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Sergio Guadarrama, Anoop Korattikara, Oscar Ramirez, Pablo Castro, Ethan Holly, Sam Fishman, Ke Wang, Ekaterina Gonina, Chris Harris, Vincent Vanhoucke, Eugene Brevdo (2018) Sergio Guadarrama, Anoop Korattikara, Oscar Ramirez, Pablo Castro, Ethan Holly, Sam Fishman, Ke Wang, Ekaterina Gonina, Chris Harris, Vincent Vanhoucke, Eugene Brevdo. TFAgents: A library for reinforcement learning in tensorflow. https://github.com/tensorflow/agents, 2018. URL https://github.com/tensorflow/agents. [Online; accessed 30November2018].
 Silver et al. (2016) D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
 Slonim et al. (2005) N. Slonim, G. S. Atwal, G. Tkacik, and W. Bialek. Estimating mutual information and multi–information in large networks. arXiv preprint cs/0502017, 2005.
 Stadie et al. (2015) B. C. Stadie, S. Levine, and P. Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. CoRR, abs/1507.00814, 2015. URL http://arxiv.org/abs/1507.00814.
 Stolle and Precup (2002) M. Stolle and D. Precup. Learning options in reinforcement learning. In International Symposium on abstraction, reformulation, and approximation, pages 212–223. Springer, 2002.
 Sutton et al. (1999) R. S. Sutton, D. Precup, and S. Singh. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211, 1999.
 Tang et al. (2017) H. Tang, R. Houthooft, D. Foote, A. Stooke, O. X. Chen, Y. Duan, J. Schulman, F. DeTurck, and P. Abbeel. # exploration: A study of countbased exploration for deep reinforcement learning. In Advances in neural information processing systems, pages 2753–2762, 2017.
 Tishby et al. (2000) N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.
 Todorov et al. (2012) E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
 Vezhnevets et al. (2017) A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3540–3549. JMLR. org, 2017.
 WardeFarley et al. (2018) D. WardeFarley, T. Van de Wiele, T. Kulkarni, C. Ionescu, S. Hansen, and V. Mnih. Unsupervised control through nonparametric discriminative rewards. arXiv preprint arXiv:1811.11359, 2018.
 Watter et al. (2015) M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pages 2746–2754, 2015.
 Williams et al. (2016) G. Williams, P. Drews, B. Goldfain, J. M. Rehg, and E. A. Theodorou. Aggressive driving with model predictive path integral control. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 1433–1440. IEEE, 2016.
 Ziebart et al. (2008) B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
Appendix A Implementation Details
All of our models are written in the open source TensorflowAgents [Sergio Guadarrama, Anoop Korattikara, Oscar Ramirez, Pablo Castro, Ethan Holly, Sam Fishman, Ke Wang, Ekaterina Gonina, Chris Harris, Vincent Vanhoucke, Eugene Brevdo, 2018], based on Tensorflow [Abadi et al., 2015].
A.1 Skill Spaces
When using discrete spaces, we parameterize $\mathcal{Z}$ as onehot vectors. These onehot vectors are randomly sampled from the uniform prior $p(z)=\frac{1}{D}$, where $D$ is the number of skills, usually between 20 and 128. For continuous spaces, we sample $z\sim \text{Uniform}{(1,1)}^{D}$. We generally vary $D$ from 2 (Ant learnt with xy prior) to 5 (Humanoid on full observation spaces). The skills are sampled once in the beginning of the episode and fixed for the rest of the episode. However, it is possible to resample the skill from the prior within the episode, which allows for every skill to experience a different distribution than the initialization distribution and encourage skills which are temporally compositional. However, the resampling frequency should be such that it happens maximally once or twice per episode, so that every skill has sufficient time to act.
A.2 Agent
We use SAC as the optimizer for our agent $\pi (a\mid s,z)$, in particular, ECSAC [Haarnoja et al., 2018b]. The $s$ input to the policy generally excludes global coordinates (x, y) of the centreofmass, available for a lot of enviroments in OpenAI gym, which helps produce skills agnostic to the location of the agent. We restrict to two hidden layers for our policy and critic networks. However, to improve the expressivity of skills, it is beneficial to increase the capacity of the networks. The hidden layer sizes can vary from (128, 128) for HalfCheetah to (1024, 1024) for Humanoid. The critic $Q(s,a,z)$ is similarly parameterized. The target function for critic $Q$ is updated every iteration using a soft updates with coefficient of $0.005$. We use Adam [Kingma and Ba, 2014] optimizer with a fixed learning rate of $3e4$ , and a fixed entropy coefficient $\beta =0.1$. While the policy is parameterized as a normal distribution $\mathcal{N}(\mu (s,z),\mathrm{\Sigma}(s,z))$ where $\mathrm{\Sigma}$ is a diagonal covariance matrix, it undergoes through tanh transformation, to transform the output to the range $(1,1)$ and constrain to the action bounds.
A.3 SkillDynamics
Skilldynamics, denoted by $q({s}^{\prime}\mid s,z)$, is parameterized by a deep neural network. A common trick in modelbased RL is to predict the $\mathrm{\Delta}s={s}^{\prime}s$, rather than the full state ${s}^{\prime}$. Hence, the prediction network is $q(\mathrm{\Delta}s\mid s,z)$. Note, both parameterizations can represent the same set of functions. However, the latter will be easy to learn as $\mathrm{\Delta}s$ will be centred around 0. While the global coordinates are excluded from the input to $q$, it is useful to predict ${\mathrm{\Delta}}_{x},{\mathrm{\Delta}}_{y}$, because reward functions for goalbased navigation generally rely on the position prediction from the model. The skilldynamics has the same capacity as the agent/critic with the same hidden layer sizes. The output distribution is modelled as a MixtureofExperts [Jacobs et al., 1991], where expert is a diagonal statedependent gaussian, and every expert has weight dependent on the input. The number of experts is 4. Batchnormalization was found to be useful for learning skilldynamics. However, it is important to turn off the learnable parameters for the last layer for sanity of the learning process.
A.4 Other Hyperparameters
The episode horizon is generally kept shorter for stable agents like Ant (200 usually), while longer for unstable agents like Humanoid (5001000). For Ant, longer episodes do not add value, but Humanoid can benefit from longer episodes as it helps it filter skills which are unstable. The optimization scheme is onpolicy, and generally about 10004000 steps are collected in one iteration. The idea is to get about episodes of 510 skills in a batch. Resampling skills within episodes can be useful if working with longer episodes. Once a batch of episodes is collected, the skilldynamics is updated using Adam optimizer with a fixed learning rate of 3e4. The batch size is 128, and generally 2050 steps of gradient descent are carried out. To compute the intrinsic reward, we need to resample the prior for computing the denominator. For continuous spaces, we set $L$ between 50 to 500. For discrete spaces, we can marginalize over all skills. After the intrinsic reward is computed, the policy and critic networks are updated for 64128 steps on batch size of 128. This is to ensure that every sample in the batch is seen about 34 times, in expectation.
A.5 Planning and Evaluation Setups
For evaluation, we fix the episode horizon to 200 for all models in all evaluation setups. Depending upon the size of the latent space and planning horizon, the number of samples from the planning distribution $P$ is varied between 10200. The coefficient $\gamma $ for MPPI is set to 10. We generally found that setting ${H}_{P}=1$ and ${H}_{Z}=10$ worked well, in which case set the number of refine steps $R=10$. However, for sparse reward navigation it is important to have a longer horizon planning, in which case we set ${H}_{P}=4,{H}_{Z}=25$ with a higher number of samples from the planning distribution. Also, when using longer planning horizons, we found that smoothing the sampled plans help. Thus, if the sampled plan is ${z}_{1},{z}_{2},{z}_{3},{z}_{4}\mathrm{\dots}$, we smooth the plan to make ${z}_{2}=\beta {z}_{1}+(1\beta ){z}_{2}$ and so on. The $\beta $ is generally kept high between 0.80.95.
For hierarchical controllers learning on top of lowlevel unsupervised primitives, we use PPO [Schulman et al., 2017] for discrete action skills, while we use SAC for continuous skills. We keep the number of steps after which the metaaction is decided as 10 (that is ${H}_{Z}=10$). The hidden layer sizes of the metacontroller are (128, 128). We use a learning rate of 1e4 for PPO and 3e4 for SAC.
Appendix B Graphical models, Information Bottleneck and Unsupervised Skill Learning
We now present a novel perspective on unsupervised skill learning, motivated from the literature on Information Bottleneck. This section takes inspiration from [Alemi and Fischer, 2018], which helps us provide a rigorous justification for our objective proposed earlier. To obtain our unsupervised RL objective, we setup a graphical model $P$ as shown in Figure 10, which represents the distribution of trajectories generated by a given policy $\pi $. The joint distribution is given by:
$p({s}_{1},{a}_{1}\mathrm{\dots}{a}_{T1},{s}_{T},z)=p(z)p({s}_{1}){\displaystyle \prod _{t=1}^{T1}}\pi ({a}_{t}{s}_{t},z)p({s}_{t+1}{s}_{t},{a}_{t}).$  (11) 
We setup another graphical model $Q$, which represents the desired model of the world. In particular, we are interested in approximating $p({s}^{\prime}s,z)$, which represents the transition function for a particular primitive. This abstraction helps us get away from knowing the exact actions, enabling modelbased planning in behavior space (as discussed in the main paper). The joint distribution for $Q$ shown in Figure 10 is given by:
$q({s}_{1},\mathrm{\dots}{s}_{T},z)=p(z)p({s}_{1}){\displaystyle \prod _{t=1}^{T1}}q({s}_{t+1}{s}_{t},z).$  (12) 
The goal of our approach is to optimize the distribution $\pi (as,z)$ in the graphical model $P$ to minimize the distance between the two distributions, when transforming to the representation of the graphical model $Q$. In particular, we are interested in minimizing the KL divergence between $p$ and $q$  ${\mathcal{D}}_{KL}(pq)$. However, since $q$ is not known apriori, we setup the objective as ${\mathrm{min}}_{q\in Q}{\mathcal{D}}_{KL}(pq)$, which is the reverse information projection [Csiszár and Matus, 2003]. An alternate way to understand the objective is to optimize the distribution $P$ to optimally project onto the graphical model $Q$. Note, if $Q$ had the same structure as $P$, the information lost in projection would be $0$ for any valid $P$. Interestingly, it was shown in Friedman et al. [2001] that:
$\underset{q}{\mathrm{min}}{\mathcal{D}}_{KL}(pq)={\mathcal{I}}_{P}{\mathcal{I}}_{Q},$  (13) 
where ${\mathcal{I}}_{P}$ and ${\mathcal{I}}_{Q}$ represents the multiinformation for distribution $P$ on the respective graphical models. The multiinformation [Slonim et al., 2005] for a graphical model $G$ with nodes ${g}_{i}$ is defined as:
${\mathcal{I}}_{G}={\displaystyle \sum _{i}}I({g}_{i};Pa({g}_{i})),$  (14) 
where $Pa({g}_{i})$ denotes the nodes upon which ${g}_{i}$ has conditional dependence in G. Using this definition, we can compute the multiinformation terms:
${\mathcal{I}}_{P}$  $={\displaystyle \sum _{t=1}^{T}}I({a}_{t};\{{s}_{t},z\})+{\displaystyle \sum _{t=2}^{T}}I({s}_{t};\{{s}_{t1},{a}_{t1}\})\mathit{\hspace{1em}}\text{and}\mathit{\hspace{1em}}{\mathcal{I}}_{Q}={\displaystyle \sum _{t=2}^{T}}I({s}_{t};\{{s}_{t1},z\}).$  (15) 
Here, $I({s}_{t};\{{s}_{t1},{a}_{t1}\})$ is constant as we assume the underlying dynamics to be fixed (and unknown), and we can safely ignore this term. The final objective to be maximized is given by:
$R(\pi )$  $={\displaystyle \sum _{t=1}^{T1}}I({s}_{t+1};\{{s}_{t},z\})I({a}_{t};\{{s}_{t},z\})$  (16)  
$={\displaystyle \sum _{t=1}^{T1}}\mathscr{H}({s}_{t+1})\mathscr{H}({s}_{t+1}\mid {s}_{t},z)I({a}_{t};\{{s}_{t},z\})$  (17)  
$\ge {\displaystyle \sum _{t=1}^{T1}}I({s}_{t+1};z\mid s)I({a}_{t};\{{s}_{t},z\})$  (18) 
Here, we have used the nonnegativity of mutual information, that is $I({s}^{\prime};s)\ge 0\u27f9\mathscr{H}({s}^{\prime})\ge \mathscr{H}({s}^{\prime}\mid s)$. This yields us the objective that we proposed to begin with. This results in an unsupervised skill learning objective that explicitly fits a model for transition behaviors, while providing a grounded connection with probabilistic graphical models. Note, unlike the setup of control as inference [Levine, 2018, Ziebart et al., 2008] which casts policy learning as variational inference, the policy here is assumed to be part of the generative model itself (and thus the resulting difference in the direction of ${\mathcal{D}}_{KL}$).
We can carry out the exercise for the reward function in Diversity is All You Need (DIAYN) Eysenbach et al. [2018] to provide a graphical model interpretation of the objective used in the paper. To conform with objective in the paper, we assume to be sampling to be stateaction pairs from skillconditioned stationary distributions in the world P, rather than trajectories. Again, the objective to be maximized is given by
$R(\pi )$  $={\mathcal{I}}_{P}+{\mathcal{I}}_{Q}$  (19)  
$=I(a;\{s,z\})+I(z;s)$  (20)  
$={\mathbb{E}}_{\pi}[\mathrm{log}{\displaystyle \frac{p(zs)}{p(z)}}\mathrm{log}{\displaystyle \frac{\pi (as,z)}{\pi (a)}}]$  (21)  
$\ge {\mathbb{E}}_{\pi}[\mathrm{log}{q}_{\varphi}(zs)\mathrm{log}p(z)\mathrm{log}\pi (as,z)]=R(\pi ,{q}_{\varphi})$  (22) 
where we have used the variational inequalities to replace $p(zs)$ with ${q}_{\varphi}(zs)$ and $\pi (a)$ with a uniform prior over bounded actions $p(a)$ (which is ignored as a constant).
Appendix C Interpretation as Empowerment in the Latent Space
Recall, the empowerment objective Mohamed and Rezende [2015] can be stated as
$I({s}^{\prime};as)=\mathscr{H}(as)\mathscr{H}(a{s}^{\prime},s)\ge \mathscr{H}(as)+{\mathbb{E}}_{p(s)\pi (as)p({s}^{\prime}s,a)}[\mathrm{log}{q}_{\varphi}(a{s}^{\prime},s))]$  (23) 
where the we are are learning a flat policy $\pi (as)$, and using the variational approximation $q(a{s}^{\prime},s)$ for the true actionposterior $p(a{s}^{\prime},s)$. We can connect our objective with empowerment if we assume a latentconditioned policy $\pi (as,z)$ and optimize $I({s}^{\prime};zs)$, which can be interpreted as empowerment in the latent space $z$. There are two ways to decompose this objective:
$I({s}^{\prime};zs)$  $=\mathscr{H}(zs)\mathscr{H}(zs,{s}^{\prime})$  (24)  
$=\mathscr{H}({s}^{\prime}s)\mathscr{H}({s}^{\prime}s,z)$  (25) 
Using the first decomposition, we can construct a an objective using a variational lower bound which learns the network ${q}_{\varphi}(z{s}^{\prime},s)$. This is an inference network, which learns to discriminate skills based on the transitions they generate in the environment and not the statedistribution induced by each skill. However, we are interested in learning the network ${q}_{\varphi}({s}^{\prime}s,z)$, which is why we work with the second decomposition. But, again we are stuck with marginal transition entropy, which is intractable to compute. We can handle it in a couple of ways:
$I({s}^{\prime};zs)$  $\ge {\mathbb{E}}_{s}{\mathbb{E}}_{z}{\mathbb{E}}_{p({s}^{\prime}s,z)}[\mathrm{log}{\displaystyle \frac{{q}_{\varphi}({s}^{\prime}s,z)}{p({s}^{\prime}s)}}]$  (26)  
$\approx {\mathbb{E}}_{s}{\mathbb{E}}_{z}{\mathbb{E}}_{p({s}^{\prime}s,z)}[\mathrm{log}{\displaystyle \frac{{q}_{\varphi}({s}^{\prime}s,z)}{{\sum}_{i=1}^{L}{q}_{\varphi}({s}^{\prime}s,{z}_{i})}}+\mathrm{log}L]$  (27) 
where $p({s}^{\prime}s)$ represents the distribution of transitions from the state $s$. Note, we are using the approximation $p({s}^{\prime}s)=\int p({s}^{\prime}s,z)p(z)\mathit{d}z\approx \frac{1}{L}{\sum}_{i=1}^{L}{q}_{\varphi}({s}^{\prime}s,{z}_{i})$. Our use of $q({s}^{\prime}s,z)$ encodes the intuition that the $q$ should represent the distribution of transitions from $s$ under different primitives, and thus the marginal of $q$ over $z$ should approximately represent $p({s}^{\prime}s)$. However, this procedure does not yield entropyregularized RL by itself, but arguments similar to those provided for Information Maximization algorithm by Mohamed and Rezende [2015] can be made here to justify it in this empowerment perspective.
Note, this procedure makes an assumption $p(zs)=p(z)$ when approximating $p({s}^{\prime}s)$. While every skill is expected to induce a different statedistribution in principle, this is not a bad assumption to make as we often times expect skills to be almost stateindependent (consider locomotion primitives, which can essentially be activated from the statedistribution of any other locomotion primitive). The impact of this assumption can be further attenuated if skills are randomly resampled from the prior $p(z)$ within an episode of interaction with the environment. Irrespective, we can avoid making this assumption if we use the variational lower bounds from Agakov [2004], which is the second way to learn for $I({s}^{\prime};zs)$. We use the following inequality, used in Hausman et al. [2018]:
$\mathscr{H}(x)\ge {\displaystyle \int p(x,z)\mathrm{log}\frac{q(zx)}{p(x,z)}dxdz}$  (28) 
where $q$ is a variational approximation to the posterior $p(zx)$.
$I({s}^{\prime};zs)$  $=\mathscr{H}({s}^{\prime}s,z)+\mathscr{H}({s}^{\prime}s)$  (29)  
$\ge {\mathbb{E}}_{s}{\mathbb{E}}_{p({s}^{\prime},zs)}[\mathrm{log}{q}_{\varphi}({s}^{\prime}s,z)]+{\mathbb{E}}_{s}{\mathbb{E}}_{p({s}^{\prime},zs)}[\mathrm{log}{q}_{\alpha}(z{s}^{\prime},s)]+\mathscr{H}({s}^{\prime},zs)$  (30)  
$={\mathbb{E}}_{s}{\mathbb{E}}_{p({s}^{\prime},zs)}[\mathrm{log}{q}_{\varphi}({s}^{\prime}s,z)+\mathrm{log}{q}_{\alpha}(z{s}^{\prime},s)]+\mathscr{H}({s}^{\prime},zs)$  (31) 
where we have used the inequality for $\mathscr{H}({s}^{\prime}s)$ to introduce the variational posterior for skill inference ${q}_{\alpha}(z\mid {s}^{\prime},s)$ besides the conventional variational lower bound to introduce $q({s}^{\prime}\mid s,z)$. Further decomposing the leftover entropy:
$\mathscr{H}({s}^{\prime},zs)=\mathscr{H}(zs)+\mathscr{H}({s}^{\prime}s,z)$ 
Reusing the variational lower bound for marginal entropy from Agakov [2004], we get:
$\mathscr{H}({s}^{\prime}s,z)$  $\ge {\mathbb{E}}_{s,z}\left[{\displaystyle \int p({s}^{\prime},as,z)\mathrm{log}\frac{q(a{s}^{\prime},s,z)}{p({s}^{\prime},as,z)}d{s}^{\prime}da}\right]$  (32)  
$=\mathrm{log}c+\mathscr{H}({s}^{\prime},as,z)$  (33)  
$=\mathrm{log}c+\mathscr{H}({s}^{\prime}s,a,z)+\mathscr{H}(as,z)$  (34) 
Since, the choice of posterior is upon us, we can choose $q(a{s}^{\prime},s,z)=1/c$ to induce a uniform distribution for the bounded action space. For $\mathscr{H}({s}^{\prime}s,a,z)$, notice that the underlying dynamics $p({s}^{\prime}s,a)$ are independent of $z$, but the actions do depend upon $z$. Therefore, this corresponds to entropyregularized RL when the dynamics of the system are deterministic. Even for stochastic dynamics, the analogy might be a good approximation , assuming the underlying dynamics are not very entropic. The final objective (making this lowentropy dynamics assumption) can be written as:
$I({s}^{\prime};zs)\ge {\mathbb{E}}_{s}{\mathbb{E}}_{p({s}^{\prime},zs)}[\mathrm{log}{q}_{\varphi}({s}^{\prime}s,z)+\mathrm{log}{q}_{\alpha}(z{s}^{\prime},s)\mathrm{log}p(zs)]+\mathscr{H}(as,z)$  (35) 
We defer experimentation with this objective to future work.
Appendix D Interpolation in Continuous Latent Space
[width=0.3]images/image18.png \includegraphics[width=0.3]images/image19.png \includegraphics[width=0.3]images/image26.png
Appendix E Model Prediction
From Figure 14, we observe that skilldynamics can provide robust statepredictions over long planning horizons. When learning skilldynamics with xy prior, we observe that the error in prediction rises slower with horizon as compared to the norm of the actual position. This provides strong evidence of cooperation between the primitives and skilldynamics learned using DADS with xy prior. As the errorgrowth for skilldynamics learned on fullobservation space is subexponential, similar argument can be made for DADS without xy prior as well (albeit to a weaker extent).