Abstract
In many realworld scenarios, an autonomous agent often encounters varioustasks within a single complex environment. We propose to build a graphabstraction over the environment structure to accelerate the learning of thesetasks. Here, nodes are important points of interest (pivotal states) and edgesrepresent feasible traversals between them. Our approach has two stages. First,we jointly train a latent pivotal state model and a curiositydrivengoalconditioned policy in a taskagnostic manner. Second, provided with theinformation from the world graph, a highlevel Manager quickly finds solutionto new tasks and expresses subgoals in reference to pivotal states to alowlevel Worker. The Worker can then also leverage the graph to easilytraverse to the pivotal states of interest, even across long distance, andexplore nonlocally. We perform a thorough ablation study to evaluate ourapproach on a suite of challenging maze tasks, demonstrating significantadvantages from the proposed framework over baselines that lack world graphknowledge in terms of performance and efficiency.
Quick Read (beta)
Learning World Graphs to Accelerate
Hierarchical Reinforcement Learning
Abstract
In many realworld scenarios, an autonomous agent often encounters various tasks within a single complex environment. We propose to build a graph abstraction over the environment structure to accelerate the learning of these tasks. Here, nodes are important points of interest (pivotal states) and edges represent feasible traversals between them. Our approach has two stages. First, we jointly train a latent pivotal state model and a curiositydriven goalconditioned policy in a taskagnostic manner. Second, provided with the information from the world graph, a highlevel Manager quickly finds solution to new tasks and expresses subgoals in reference to pivotal states to a lowlevel Worker. The Worker can then also leverage the graph to easily traverse to the pivotal states of interest, even across long distance, and explore nonlocally. We perform a thorough ablation study to evaluate our approach on a suite of challenging maze tasks, demonstrating significant advantages from the proposed framework over baselines that lack world graph knowledge in terms of performance and efficiency.
Learning World Graphs to Accelerate
Hierarchical Reinforcement Learning
Wenling Shang^{†}^{†}thanks: : also University of Amsterdam, corresponding at [email protected]. $\mathrm{\u2020}$: indicates equal contributions. See project page (link) for selected video demos and the Appendix. Alex Trott${}^{\mathrm{\u2020}}$ Stephan Zheng${}^{\mathrm{\u2020}}$ Caiming Xiong Richard Socher Salesforce Research
noticebox[b]Preprint. Under review.\[email protected]
1 Introduction
Many real world scenarios require an autonomous agent to play different roles within a single complex environment. For example, Mars rovers carry out scientific objectives ranging from searching for rocks to calibrating orbiting instruments [71]. Intuitively, a good understanding of the highlevel structure of its operational environment would help an agent accomplish its downstream tasks. In reality, however, both acquiring such world knowledge and effectively applying it to solve tasks are often challenging. To address these challenges, we propose a generic twostage framework that first learns highlevel world structure in the form of a simple directed weighted graph [11] and then integrates it into a hierarchical policy model.
In the initial stage, we alternate between exploring the world and updating a descriptor of the world in a graph format [11], referred to as the world graph (Figure 1), in an unsupervised fashion. The nodes, termed pivotal states, are the most critical states in recovering action trajectories [17, 47, 32]. In particular, given a set of trajectories, we optimize a fully differentiable recurrent variational autoencoder [19, 34, 51] with binary latent variables [70]. Each binary latent variable is designated to a state and the prior distribution learned conditioning on that state indicates whether it belongs to the set of pivotal states. Wideranging and meaningful training trajectories are therefore essential ingredients to the success of the latent model. Existing world descriptor learning frameworks often use random [37] or curiositydriven trajectories [4]. Our exploring agent collects trajectories from both random walks and a simultaneously learned curiositydriven goalconditioned policy [32, 69]. During training, exploration is also initiated from the current set of pivotal states, similar to the “cells” in GoExplore [25], except that ours are learned by the latent model instead of using heuristics. The edges of the graph, extrapolated from both the trajectories and the goalconditioned policy, correspond to the actionable transitions between closeby pivotal states. Finally, the goalconditioned policy can be used to further promote transfer learning on downstream tasks [86].
At first glimpse, the world graph seems suitable for modelbased RL [59, 50], but our method emphasizes the connections among neighboring pivotal states rather than transitions over any arbitrary pair, which is usually considered a much harder problem [35]. Therefore, in the next stage, we propose a hierarchical reinforcement learning [54, 63] (HRL) approach to incorporate the world graph for solving specific downstream tasks. Concretely, within the paradigm of goalconditioned HRL [21, 68, 90, 57], our approach innovates how the highlevel Manager provides goals and how the lowlevel Worker navigates. Instead of sending out a single objective, the Manager first selects a pivotal state from the world graph and then specifies a final goal within a nearby neighborhood of the selected pivotal state. We refer to this sequential selection as the WidethenNarrow (WN) instruction. This construction allows us to utilize the information from the learned graph descriptor and to form passages between pivotal states through application of graph traversal techniques [9], thanks to which the Worker can now focus on local objectives. Lastly, as previously mentioned, the goalconditioned policy derived from learning the world graph can serve as an initialization to the Manager and Worker, allowing fast skill transfer to new tasks as demonstrated by our experiments.
In summary, our main contributions are:

•
A complete twostage framework for 1) unsupervised world graph discovery and 2) accelerated HRL by integrating the graph.

•
The first stage proposes an unsupervised module to learn world graphs, including a novel recurrent differentiable binary latent model and a curiositydriven goalconditioned policy.

•
The second stage proposes a general HRL scheme with novel components such as the WidethenNarrow instruction, navigation via world graph traversal and skill transfer from goalconditioned policy models.

•
Quantitative and qualitative empirical findings over a complex 2D maze domain show that our proposed framework 1) produces a graph descriptor representative of the world and 2) improves both sample efficiency and final performance in solving downstream tasks by a large margin over baselines that lack the descriptor.
2 Environment
For ease of clear exposition and scientific control, we choose finite, fully observable yet complex 2D mazes [18] as our testbeds, i.e. for each stateaction pair and their transitions $({s}_{t},{a}_{t})\to {s}_{t+1}$, ${s}_{t},{s}_{t+1}\in \mathcal{S},{a}_{t}\in \mathcal{A}$ are finite. More involved environments can introduce interfering factors, shadowing the effects from the proposed method, e.g. the need of a wellcalibrated latent goal space [44, 24, 67]. Section 7 briefly speculates on extensions of our framework to other environments as future directions. We employ 3 mazes of small, medium and large sizes with varying compositions (see the Appendix for visualization). Despite being finite and fully observable, these mazes still pose much challenge, especially when the environment becomes large, engages stochasticity, provides only sparse reward or requires more complicated logic. The maze states received by the agent are in the form of birdeye view matrix representations. More details on preprocessing are available in the Appendix.
3 World Graph Discovery
We envision a simple directed weighted graph [11] ${\mathcal{G}}_{w}$ to capture the highlevel structure of the world. Its nodes are a set of points of interest, termed pivotal states (${s}_{p}\in {\mathcal{V}}_{p}$), and edges represent transitions between nodes. Drawing intuition from unsupervised sequence segmentation [17, 47] and imitation learning [1, 46], we define ${\mathcal{V}}_{p}$ as the most critical states in recovering action sequences generated by some agent, indicating that these states lead to the most information gain [4]. In other words, given a trajectory $\tau ={\{({s}_{t},{a}_{t})\}}_{0}^{T}$, we learn to identify the most critical subset of states $\{{s}_{t}{s}_{t}\in {\mathcal{V}}_{p}\}$ for accurately inferring the action sequences taken in the full trajectory $\tau $.
Supposing the stateaction trajectories are available, we formulate a recurrent variational inference model (see Section 3.1) [12, 19, 34, 51], treating the action sequences as evidence and, for each state ${s}_{t}$ in the sequence, inferring a binary latent variable ${z}_{t}$ that controls whether to keep the state for action reconstruction. We learn a prior over each latent ${z}_{t}$ conditioned on its designated state ${s}_{t}$, as opposed to using a fixed prior or conditioning on the surrounding trajectory, and use the prior mean as the criterion for including ${s}_{t}$ in ${\mathcal{V}}_{p}$.
Meaningful ${\mathcal{V}}_{p}$ are learned from meaningful trajectories; hence, we develop a procedure to alternately update the latent model and improve trajectory collections. When collecting training trajectories, we place the agent at a state from the current iteration’s set of ${\mathcal{V}}_{p}$—this is possible since the agent can straightforwardly document and reuse the paths from its initial position to states in ${\mathcal{V}}_{p}$. This way naturally allows the exploration starting points to expand as the agent discovers more of its environment. Random walk trajectories tend to be noisy thus perhaps irrelevant to real tasks. We instead take inspiration from prior work on actionable representations [32] and learn a goalconditioned policy ${\pi}_{g}$ for navigating between closeby states, reusing observed trajectories for unsupervised learning (Section 3.2). To ensure broad state coverage and diverse trajectories, we add a curiosity reward from the unsupervised action reconstruction error to learn ${\pi}_{g}$. The latent model is then updated with new trajectories. This cycle repeats until the action reconstruction accuracy plateaus. To form the edges of ${\mathcal{G}}_{w}$ and, we again use both random trajectories and ${\pi}_{g}$ (Section 3.3). Lastly, the implicit knowledge of the world embedded in ${\pi}_{g}$ can be further transferred to downstream tasks through weight initialization, which will be discussed later on (Section 4.4).
The pseudocode summarization of world graph discovery, implementation details, a visualization of how ${\mathcal{V}}_{p}$ progresses over training and the final ${\mathcal{V}}_{p}$ from different rollout policies are provided in the Appendix. The following sections concretely describe each component of our proposed process.
3.1 Recurrent Variational Model with Differentiable Binary Latent Variables
We propose a recurrent variational model with differentiable binary latent variables to discover ${\mathcal{V}}_{p}$ (Figure 2). Given a trajectory $\tau ={\{({s}_{t},{a}_{t})\}}_{0}^{T}$, we treat the action sequence ${\{{a}_{t}\}}_{0}^{T1}$ as evidence in order to infer a sequence of binary latent variables ${z}_{t}$’s. The evidence lower bound is
$$\mathrm{ELBO}={\mathrm{\mathbb{E}}}_{{q}_{\varphi}(\mathbf{Z}\mathbf{A},\mathbf{S})}\left[\mathrm{log}{p}_{\theta}(\mathbf{A}\mathbf{S},\mathbf{Z})\right]+{D}_{\mathrm{KL}}({q}_{\varphi}(\mathbf{Z}\mathbf{A},\mathbf{S}){p}_{\psi}(\mathbf{Z}\mathbf{S})).$$  (1) 
The objective ${\mathrm{\mathbb{E}}}_{{q}_{\varphi}(\mathbf{Z}\mathbf{A},\mathbf{S})}\left[\mathrm{log}{p}_{\theta}(\mathbf{A}\mathbf{S},\mathbf{Z})\right]$ is to reconstruct the action sequence given only the states ${s}_{t}$ where ${z}_{t}=1$, with the boundary states always given ${s}_{0}={s}_{T}=1$. To ensure differentiablity, we opt to use a continuous relaxation of discrete binary latent variables by learning a Beta distribution as the priors for ${z}_{t}$’s [82]. Moreover, we learn the prior for each ${z}_{t}$ conditioned on its associated state ${s}_{t}$ (Figure 2). The prior mean for each ${z}_{t}$ signifies on average how necessary ${s}_{t}$ is for action reconstruction. Also, the KLdivergence term in Equation 1 between the approximated posterior and the learned prior encourages similar trajectories to pick the same states for action reconstruction. We define ${\mathcal{V}}_{p}$ as the top 20$\%$ states ranked by the learned prior means.
The approximate posteriors follow the Hard Kumaraswamy distribution [8] [$\mathrm{HardKuma}({\stackrel{~}{\alpha}}_{t},{\stackrel{~}{\beta}}_{t})$] which resemble the Beta distribution but is outside the exponential family. This choice allows us to sample 0’s and 1’s without sacrificing differentiability, accomplished via the stretchandrectify procedure [8, 60]. The simple CDF of Kuma also makes the reparameterization trick easily applicble [51, 79, 62]. Lastly, KLdivergence between Kuma and Beta distribution can be approximated in closed form [70]. We fix ${\stackrel{~}{\beta}}_{t}=1$ to ease optimization since the Kuma and Beta distributions coincide when ${\alpha}_{i}=\stackrel{~}{{\alpha}_{i}},{\beta}_{i}={\stackrel{~}{\beta}}_{i}=1$.
There is not yet any constraint to prevent the model from selecting all states to reconstruct ${\{{a}_{t}\}}_{0}^{T1}$. To introduce a selection bottleneck, we impose a regularization on the expected ${L}_{0}$ norm of $\mathbf{Z}=({z}_{1}\mathrm{\cdots}{z}_{T1})$ to promote sparsity at a targeted value ${\mu}_{0}$ [60, 8]. In other words, this objective constraints that there should be ${\mu}_{0}$ of activated ${z}_{t}=1$ given a sequence of length $T$. Another similarly constructed transition regularization encourages isolated activation of ${z}_{t}$, meaning the number of transition between $0$ and $1$ among ${z}_{t}$’s should roughly be $2{\mu}_{0}$. Note that both expectations in Equation 2 have closed forms for HardKuma.
$${\mathcal{L}}_{0}={\parallel {\mathrm{\mathbb{E}}}_{{q}_{\varphi}(\mathbf{Z}\mathbf{S},\mathbf{A})}[{\parallel \mathbf{Z}\parallel}_{0}]{\mu}_{0}\parallel}^{2},{\mathcal{L}}_{T}={\parallel {\mathrm{\mathbb{E}}}_{{q}_{\varphi}(\mathbf{Z}\mathbf{S},\mathbf{A})}{\mathrm{\Sigma}}_{t=0}^{T}{\mathrm{\U0001d7d9}}_{{z}_{t}\ne {z}_{t+1}}2{\mu}_{0}\parallel}^{2}$$  (2) 
Lagrangian Relaxation.
The overall optimization objective consists of action sequence reconstruction, KLdivergence, ${\mathcal{L}}_{0}$ and ${\mathcal{L}}_{T}$ (Equation 3). We tune the objective weights ${\lambda}_{i}$ using Lagrangian relaxation [43, 8, 10], treating ${\lambda}_{i}$’s as learnable parameters and performing alternative optimization between ${\lambda}_{i}$’s and the model parameters. We observe that as long as their initialization is within a reasonable range, ${\lambda}_{i}$’s converge to local optimum autonomously,
$$\underset{\{{\lambda}_{1,2,3}\}}{\mathrm{max}}\underset{\{\theta ,\varphi ,\psi \}}{\mathrm{min}}{\mathrm{\mathbb{E}}}_{{q}_{\psi}(\mathbf{Z}\mathbf{A},\mathbf{S})}\left[\mathrm{log}{p}_{\theta}(\mathbf{A}\mathbf{S},\mathbf{Z})\right]+{\lambda}_{1}{D}_{\mathrm{KL}}({q}_{\varphi}(\mathbf{Z}\mathbf{A},\mathbf{S}){p}_{\psi}(\mathbf{Z}\mathbf{S}))+{\lambda}_{2}{\mathcal{L}}_{0}+{\lambda}_{3}{\mathcal{L}}_{T}.$$  (3) 
Our finalized latent model allows efficient and stable minibatch training. Alternative designs, such as Poisson prior [52] for latent space and Transformer [89] for sequence modeling, are also possibilities for future investigation. More details related to the latent model can be found in the Appendix.
3.2 CuriosityDriven GoalConditioned Agent
A goalconditioned policy, $\pi ({a}_{t}{s}_{t},g)$, or ${\pi}_{g}$, is trained to reach a goal state $g\in \mathcal{S}$ given current state ${s}_{t}$ [32]. For large state spaces, training a goalconditioned policy to navigate between any two states is nontrivial. However, our usecases, including trajectory generation for unsupervised learning and navigation between nearby pivot states in downstream tasks, only require ${\pi}_{g}$ to reach goals over a short range. We train such an A2Cbased goalconditioned policy by sampling goals using the end points of random walks with reasonable length from a given starting state. Inspired by the success of intrinsic motivation methods—in particular, curiosity [14, 2, 75, 4]—we leverage the readily available action reconstruction errors from the generative decoder as intrinsic reward signals to boost exploration when training ${\pi}_{g}$. The pseudocode describing this method is in the Appendix.
3.3 Edge Connections
The last crucial step towards the world graph completion is building the edge connections. After finalizing ${\mathcal{V}}_{p}$, we perform random walks from ${s}_{p}\in {\mathcal{V}}_{p}$ to discover the underlying adjacency matrix [11] connecting individual ${s}_{p}$’s. More precisely, we claim a directed edge ${s}_{p}\to {s}_{q}$ if there exist a random walk trajectory from ${s}_{p}$ to ${s}_{q}$ that does not intersect a third pivotal state. We then collect the shortest such actionable paths as the edge paths. Each path is further refined by ${\pi}_{g}$ if feasible. The action sequence length of the edge path between adjacent pivotal states defines the weight of the edge. Traversal between pivotal states are planned basing on the weight information using dynamic programming [85, 27]. For deterministic environments, the agents can simply follow the action sequence from the edge to transit between pivotal states. When the environment is stochastic, the agent traverses following the goalconditioned policy (see Section 4.3). The planing in this case can potentially be improved by probabilistically or functionally encoding the edge weights [94, 80], which is left for future work.
4 Accelerated Hierarchical Reinforcement Learning
We now introduce a hierarchical reinforcement learning [54, 63] (HRL) framework that leverages the world graph ${\mathcal{G}}_{w}$ to accelerate learning downstream tasks. This framework has three core features:

•
the Manager uses twostep “WidethenNarrow” goal descriptions (Section 4.2),

•
the Worker traverses the learned world graph ${\mathcal{G}}_{w}$ at appropriate time(Section 4.3),

•
the goalconditioned policy ${\pi}_{g}$ learned in the graph discovery stage is used for weight initialization for the Worker and Manager (Section 4.4).
We show that our method learns to solve new tasks significantly faster and better compared to related baselines (Section 4.1). For implementation details, see the Appendix.
4.1 Preliminaries and Hierarchical Reinforcement Learning
Formally, we consider a Markov Decision Process, where at time $t$ an agent in a state ${s}_{t}$ executes an action ${a}_{t}$ via a policy $\pi ({a}_{t}{s}_{t})$ and receives rewards ${r}_{t}$. The agent’s goal is to maximize its cumulative expected return $R={\mathrm{\mathbb{E}}}_{({s}_{t},{a}_{t})\sim \pi ,P,{P}_{0}}[{r}_{t}]$, where $p({s}_{t+1}{s}_{t},{a}_{t}),{p}_{0}({s}_{0})$ are the transition and initial state distributions. To solve this problem, we consider a modelfree, onpolicy learning baseline, the advantage actorcritic (A2C) [93, 74, 66] and its hierarchical extension called Feudal Network (FN) [21, 90] (Figure 4). At the core, A2C models both a value function $V({s}_{t})$ by regressing over the estimated ${t}_{\mathrm{max}}$step discounted returns with discount rate $\gamma \in (0,1)$, ${\mathrm{\Sigma}}_{{t}^{\prime}=t}^{{t}_{\mathrm{max}}}{\gamma}^{{t}^{\prime}t}{r}_{{t}^{\prime}}$, and a policy $\pi $ guided by advantagebased policy gradient [83]. The policy entropy is regularized to encourage exploration. A2C’s hierarchical extension FN consists of a highlevel controller (“Manager”), which learns to propose subgoals to the lowlevel controller (“Worker”), which learns to complete the subgoals. The Manager receives rewards from the environment and the Worker receives rewards from the Manager by reaching its subgoals. The high and lowlevel policy models are distinct and operate at different temporal resolutions: the Manager only outputs a new subgoal if either Worker completes its current one or the subgoal horizon $c$ is exceeded. In this work, we mainly consider finite, discrete and fully observable mazes. As such, FN can use any state as a subgoal and the Manager policy can emit a probability vector of dimension $\mathcal{S}$, although our framework supports more general subgoal definitions. More implementation details and pseudocodes on our baselines are in the Appendix.
4.2 World Graph Nodes for WidethenNarrow Manager Instructions
To connect the learned graph ${\mathcal{G}}_{w}$ to the HRL framework, the Manager needs the ability to designate any state $s$ as a subgoal while using the abstraction provided by ${\mathcal{G}}_{w}$. To that end, we structure the Manager’s output using a WidethenNarrow (WN) format:

1.
Given a predefined set of candidate states $\mathcal{V}$, the Manager uses a “wide” policy ${\pi}^{\omega}({s}_{t})$ that outputs a “wide” subgoal ${g}_{w}\in \mathcal{V}$. This work proposes the “wide” goals come from the learned pivotal states $\mathcal{V}={\mathcal{V}}_{p}$.

2.
Next, the Manager zooms its attention to an $N\times N$ local area ${s}_{w,t}={s}_{w}({g}_{w,t})$ around ${g}_{w}$. A “narrowgoal” policy ${\pi}^{n}$ then selects a final “narrow” goal ${g}_{n}\in {s}_{w,t}$, using both global ${s}_{t}$ and local ${s}_{w,t}$ information. Both goals are passed to the Worker, who is rewarded if reaching ${g}_{w}$ or ${g}_{n}$ within the horizon $c$.
Using the WN goal format, the policy gradient for the Manager policy ${\pi}_{m}$ becomes:
${\mathrm{\mathbb{E}}}_{({s}_{t},{a}_{t})\sim \pi ,p,{p}_{0}}[{A}_{m,t}\nabla \mathrm{log}\left({\pi}^{\omega}({g}_{w,t}{s}_{t}){\pi}^{n}({g}_{n,t}{s}_{t},{g}_{w,t},{s}_{w,t})\right)]+\nabla [\mathscr{H}\left({\pi}^{\omega}\right)+\mathscr{H}\left({\pi}^{n}(\cdot {g}_{w,t})\right)],$ 
where ${A}_{m,t}$ is the Manager’s advantage at time $t$. Since the size of the action space scales linearly with $\mathcal{S}$, the exact entropy for the ${\pi}^{m}$ can easily become intractable (see the Appendix). Thus in practice we resort to an effective alternative $\mathscr{H}\left({\pi}^{\omega}\right)+\mathscr{H}\left({\pi}^{n}(\cdot {g}_{w,t})\right)$.
4.3 Using World Graph Edges for Traversal
With pivotal states serving as widegoals, we can effectively take advantage of the edges in the world graph ${\mathcal{G}}_{w}$ through graph traversals:

1.
When to Traverse: When a Worker is given a goal pair $({g}_{w},{g}_{n})$, it can traverse the world via ${\mathcal{G}}_{w}$ if it encounters a pivotal state ${g}_{w}^{\prime}$ that has a feasible connection to ${g}_{w}$ in ${\mathcal{G}}_{w}$.
 2.

3.
Execution: once the route is planned, for deterministic environments, the agent simply follows the action sequence from the edge paths. For stochastic environments, we either disallow the agent to follow a route that is newly blocked and expect the Manager to adapt accordingly (e.g. in DoorKey) or rely on ${\pi}_{g}$ to navigate between pivotal states (e.g. in MultiGoalStochastic).
During learning, ${\pi}_{g}$ can be simultaneously finetuned to adapt taskspecific environment stochasticity. When traversing under ${\pi}_{g}$, if the agent fails to reach the next target pivotal state within a certain time limit, it would simply stop its current pursuit. The benefit of world graph traversal is to allow the Manager to assign more taskrelevant goals that can be far away from an agent’s position yet easily reachable by leveraging the connectivity knowledge of the world. In this way, we can speed up learning by focusing the lowlevel exploration on the relevant parts of the world only, i.e., around those highly taskrelevant pivotal states ${g}_{w}$.
4.4 Knowledge Transfer through GoalConditioned Policy Initialization
Lastly, we leverage implicit knowledge of the world acquired by ${\pi}_{g}$ during world graph discovery to the subsequent HRL training. Transferring and generalizing skills between RL tasks often leads to performance gains [86, 7] and goalconditioned policies have been shown to capture the underlying structure of the environment well [32]. Additionally, optimizing a neural network system like HRL [20] is sensitive to weight initialization [64, 55], due to its complexity and lack of clear supervision. Therefore, taking inspiration from the prevailing pretraining procedures in computer vision [81, 23] and NLP [22, 78], we achieve implicit skill transfer and improved optimization by initializing the taskspecific Worker and the Manager with the weights from ${\pi}_{g}$. Our empirical results put forward strong evidence that such initialization serves as an essential basis to solve challenging RL tasks later on, analogously to similar practices in the other domains.
Task  MultiGoal Dense Reward  MultiGoal Sparse Reward  Stochastic MultiGoal  
Maze size  Small  Medium  Large  Small  Medium  Large  Small  Medium  Large 
Baselines  
A2C  $2.04$  Fail  Fail  $\text{}0.10$  Fail  Fail  $1.38$  Fail  Fail 
FN  Fail  Fail  Fail  $0.19$  Fail  Fail  Fail  Fail  Fail 
FN $+$ ${\pi}_{g}$ init  $2.93$  Fail  Fail  Fail  Fail  Fail  $1.93$  Fail  Fail 
WideNarrow  
$+$ ${\pi}_{g}$init  
${\mathcal{V}}_{\mathrm{all}}$  $4.73$  $4.71$  Fail  $0.32$  Fail  Fail  $2.59$  $1.62$  Fail 
${\mathcal{V}}_{\mathrm{rand}}$  $3.67$  $4.72$  Fail  $0.36$  Fail  Fail  $2.82$  Fail  Fail 
${\mathcal{V}}_{p}$  $\mathbf{5.25}$  $\mathbf{5.15}$  Fail  $0.39$  Fail  Fail  $\mathbf{3.06}$  $\mathbf{2.99}$  Fail 
$+$ ${\mathrm{G}}_{w}$traversal  
${\mathcal{V}}_{\mathrm{rand}}$  $3.85$  $2.59$  $1.65$  $0.17$  $0.19$  $0.20$  Fail  $1.67$  Fail 
${\mathcal{V}}_{p}$  $3.92$  $2.56$  $2.18$  $0.24$  $0.20$  $0.16$  Fail  $2.42$  $1.05$ 
$+$ ${\pi}_{g}$init  
$+$ ${\mathrm{G}}_{w}$traversal  
${\mathcal{V}}_{\mathrm{rand}}$  $4.16$  $3.29$  $2.30$  $0.25$  $0.24$  $0.19$  $2.72$  $2.42$  $1.93$ 
${\mathcal{V}}_{p}$  $5.05$  $3.00$  $\mathbf{2.72}$  $\mathbf{0.42}$  $\mathbf{0.25}$  $\mathbf{0.26}$  $2.92$  $2.62$  $\mathbf{1.79}$ 
DoorKey  $+$ ${\pi}_{g}$ init  $+$ ${\mathrm{G}}_{w}$ traversal  $\mathrm{+}$ ${\mathrm{G}}_{w}$ init $\mathrm{+}$ traversal  
WideNarrow  ${\mathcal{V}}_{\mathrm{all}}$  ${\mathcal{V}}_{\mathrm{rand}}$  ${\mathcal{V}}_{p}$  ${\mathcal{V}}_{\mathrm{rand}}$  ${\mathcal{V}}_{p}$  ${\mathcal{V}}_{\mathrm{rand}}$  ${\mathcal{V}}_{p}$ 
Small  $94\pm 5$  $97\pm 2$  $\mathrm{\U0001d7d7\U0001d7d7}\pm 0$  Fail  $37\pm 15$  $76\pm 14$  $92\pm 2$ 
Medium  $25\pm 15$  $1\pm 1$  $56\pm 2$  Fail  Fail  $\mathrm{\U0001d7d5\U0001d7d7}\pm 11$  $76\pm 6$ 
Large  Fail  Fail  Fail  Fail  Fail  $\mathrm{\U0001d7d0\U0001d7d5}\pm 40$  $\mathrm{\U0001d7d0\U0001d7d4}\pm 19$ 
5 Experiments
We validate the effectiveness and assess the impact of each proposed component in a thorough ablation study on a set of 4 challenging maze tasks with different reward structures, levels of stochasticity and logic. Furthermore, we evaluate each task in three different mazes of increasing sizes (small, medium and large). Implementation details, snippets of the tasks and mazes are in the Appendix.
In all tasks, every action taken by the agent receives a negative reward penalty $0.01$. The other specifics of each task are:

•
In MultiGoal, the agent needs to collect 5 randomly spawned balls and exit from a designated exit point. Reaching each ball or the exit point gives reward $+1$.

•
Its sparse version, MultiGoalSparse, only gives a single reward $r\le 1$ proportional to the number of balls collected upon exiting.

•
Its stochastic version, MultiGoalStochastic, spawns a lava block at a random location each time step that immediately terminates the episode with a negative reward of $0.5$ if stepped on.

•
DoorKey is a much more difficult task that adds new actions (“pick” and “open”) and new objects to the environment (additional walls, doors, keys). The agent needs to pick up the key, open the door (reward $+1$) and reach the exit point on the other side (reward $+1$).
Control Experiments
We ablate each proposed components and compare against the nonhierarchical and hierarchical baselines, A2C and FN. The proposed components always augment on top of FN.

•
First, we test initializing the Manager and Worker with the weights of ${\pi}_{g}$.

•
Next, we evaluate WN with 3 different sets of $\mathcal{V}$’s for the Manager to pick ${g}_{w}$ from: ${\mathcal{V}}_{\mathrm{all}}$ includes all valid states, ${\mathcal{V}}_{\mathrm{rand}}$ are uniformly sampled states, ${\mathcal{V}}_{p}$ are learned pivotal states. ${\mathcal{V}}_{p}$ and ${\mathcal{V}}_{\mathrm{rand}}$ are of the same size and their edge connections are obtained in the same way (Section 3.3)^{1}^{1} 1 The edge connection of ${\mathcal{V}}_{\mathrm{all}}$ is a trivial case excluded here as every state is 1 step away from its adjacent states. Also, note neither ${\pi}_{g}$ nor guaranteed state access is available to ${\mathcal{V}}_{\mathrm{rand}}$ when forming edge connections, but we grant all prerequisites for the fairest comparison possible..

•
Finally, we enable ${\mathcal{G}}_{w}$ traversal on top of WN. If traversal is done through ${\pi}_{g}$, along side with HRL training, we also refine ${\pi}_{g}$ (if given for initialization) or learn one from scratch (if not given for initialization).
We inherit most hyperparameters from the training of ${\pi}_{g}$ in the world graph discovery stage, as the Manager and the Worker both share similar architecture as ${\pi}_{g}$. The hyperparameters of ${\pi}_{g}$ in turn follow those from [84]. For details, see the Appendix. We follow a rigorous evaluation protocol acknowledging the variability in outcomes in deep reinforcement learning [41]: each experiment is repeated with 3 seeds [92, 73], 10 additional validation seeds are used to pick the best model which is then tested on 100 testing seeds. Mean of selected testing results are in Table 1. We omit results of those experimental setups that consistently fail, meaning training is either not initiated or validation rewards are never above 0, on all tasks. See the Appendix for the full report, including standard deviations over the 3 seeds.
5.1 Empirical Analysis
Initialization with ${\pi}_{g}$
Table 1 and Figure 4 show initialization with ${\pi}_{g}$ is crucial across all tasks, especially for the hierarchical models—e.g. a randomly initialized A2C outperforms a randomly initialized FN on smallmaze MultiGoal. Models starting from scratch fail on almost all tasks within the maximal number of training iterations, unless coupled with ${\mathcal{G}}_{w}$ traversal, which is still inferior to using ${\pi}_{g}$initialization.
WidethenNarrow
Comparing A2C, FN and ${\mathcal{V}}_{\mathrm{all}}$ suggests WN is a highly effective way to structure Manager subgoals. For example, in small MultiGoal, ${\mathcal{V}}_{\mathrm{all}}$ ($4.73\pm 0.5$) surpasses FN ($2.93\pm 0.74$) by a large margin. We posit that the Manager tends to select ${g}_{w}$ from a certain smaller subset of $\mathcal{V}$, simplifying the learning of transitions between ${g}_{w}$’s for the Worker. As a result, the Worker can focus on solving local objectives. The same reasoning conceivably explains why ${\mathcal{G}}_{w}$ traversal does not yield performance gains on small and medium MultiGoal. For instance, ${\mathcal{V}}_{p}$ on small MultiGoal scores $5.25\pm 0.13$, slightly higher than with traversal ($5.05\pm 0.13$). However once mazes become large, the Worker struggles to master traversals on its own and thus starts to fail the tasks.
World Graph Traversal
In the case described above, the addition of world graph traversal plays an essential role, e.g. for large MultiGoal. As we conjectured in Section 4.3, this phenomenon can be explained by the much expanded exploration range and a lift of responsibility off the Worker to learn long distance transitions as a result of using ${\mathcal{G}}_{w}$ traversal. Moreover, Figure 4 confirms another conjecture from Section 4.3: ${\mathcal{G}}_{w}$ traversal speeds up convergence, more evidently with larger mazes. Lastly, in DoorKey, the agent needs to plan and execute a particular combination of actions. The huge discrepancy on medium DoorKey between using traversal or not, $75\pm 6$ vs $56\pm 2$, suggests ${\mathcal{G}}_{w}$ traversal indeed improves longhorizon planning.
Benefit of Learned Pivotal States
Comparing ${\mathcal{V}}_{p}$ to ${\mathcal{V}}_{\mathrm{rand}}$ reveals the quality of pivotal states identified by the latent model. Overall, ${\mathcal{V}}_{p}$ either performs better (particularly for nondeterministic environments) or similarly as ${\mathcal{V}}_{\mathrm{rand}}$, but with much less variance between different seeds. If one luckily picks a set of random states suitable for a task, it can deliver great results but the opposite is equally possible. In addition, edge formation over ${\mathcal{V}}_{\mathrm{rand}}$ still depends on the products from learning world graph, hence using pivotal states with its coupled ${\mathcal{G}}_{w}$ is more favorable over ${\mathcal{V}}_{\mathrm{rand}}$.
6 Related Works
Pivotal state discovery is related to unsupervised sequence segmentation [17, 47, 76, 13, 16] and option or subtask discovery in the context of RL [48, 6, 72, 30, 56, 53, 52]. Among them, both [52] and [76] employ sequential variational models to infer either task boundaries or key frames of demonstrations in an unsupervised manner, followed by applications to hierarchical RL or planning. Besides technical differences, instead of turning points for individual sequences, our module aims to identify a set of landmark states that all together can represent the world well. Also, our training examples come from carefully orchestrated exploration rather than human demonstrations.
Understanding the world is central to planning, control and (modelbased) RL. In robotics, one often needs to locate or navigate itself by interpreting a map of the world [61, 87, 3]. Our exploration strategy borrows the highlevel insight from robotics active localization, where robots are actively guided to investigate unfamiliar regions by humans [29, 58]. Another direction in this area is to learn a world model [4, 37, 36] that generates latent states [88, 38, 77]. If the dynamics of the world are also learned, then it can be applied to planning [65, 39] or modelbased RL [33, 50]. Although involving generative modeling, our framework differentiates itself through the functionality of our binary latent variables—indicators of whether a state, regardless of its representation, is a pivotal state.
The policylearning phase in our framework uses the paradigm of goalconditioned HRL [57, 21, 68, 90]. In addition, th WN mechanism borrows ideas from attentive object understanding [31, 5, 95] in vision. World graph traversal is inspired by classic optimal planning in Markov Decision Processes with dynamic programming [9, 27, 91, 85]. Lastly, initialization with goalconditioned policies to transfer knowledge is inspired by transfer learning [23, 86] and skill generalization in RL [7, 40, 32].
Concurrently, [26] also proposes to plan a sequence of subgoals leading to a final destination according to a graph abstraction of the world that is obtained via goalconditioned policy. However, under a different problem setup and usecase assumptions, the nodes in [26] are not learned pivotal states, but directly come from a replay buffer; similarly, their graph is taskspecific whereas ours is designed to assist a variety of downstream tasks.
7 Conclusion and Future Work
We propose a general twostage framework to learn a concise world abstraction as a simple directed graph with weighted edges that facilitates HRL for a diversity of downstream tasks. Our thorough ablation studies on several challenging maze tasks show clear advantage of each proposed innovative component in our framework.
The framework can be extended to other types of environments, such as partially observable and highdimensional ones, through, e.g., probabilistic or differentiable planning [49, 94, 80], latent embedding of (belief) states [36] and goals [67]. Other directions are to adapt our framework to evolving or constantly changing environments, through, e.g., metalearning [28], and/or to include offpolicy methods to achieve better sample efficiency. Finally, the learned world graphs can potentially be applied beyond HRL, e.g., multitasking RL [42], structured exploration by using pivotal state as checkouts [25] or in multiagent settings [15, 45].
References
 [1] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twentyfirst international conference on Machine learning, page 1. ACM, 2004.
 [2] J. Achiam and S. Sastry. Surprisebased intrinsic motivation for deep reinforcement learning. arXiv preprint arXiv:1703.01732, 2017.
 [3] A. Angeli, D. Filliat, S. Doncieux, and J.A. Meyer. A fast and incremental method for loopclosure detection using bags of visual words. IEEE Transactions on Robotics, pages 1027–1037, 2008.
 [4] M. G. Azar, B. Piot, B. A. Pires, J.B. Gril, F. Altche, and R. Munos. World discovery model. arXiv, 2019.
 [5] J. Ba, V. Mnih, and K. Kavukcuoglu. Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755, 2014.
 [6] P.L. Bacon, J. Harb, and D. Precup. The optioncritic architecture. In ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
 [7] A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, H. P. van Hasselt, and D. Silver. Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pages 4055–4065, 2017.
 [8] J. Bastings, W. Aziz, and I. Titov. Interpretable neural predictions with differentiable binary variables. In Proceedings of the 2019 Conference of the Association for Computational Linguistics, Volume 1 (Long Papers). Association for Computational Linguistics, 2019.
 [9] D. P. Bertsekas. Dynamic programming and optimal control, volume 1. 1995.
 [10] D. P. Bertsekas. Nonlinear Programming. 1999.
 [11] N. Biggs. Algebraic Graph Theory. 1993.
 [12] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
 [13] D. M. Blei and P. J. Moreno. Topic segmentation with an aspect hidden markov model. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 343–348. ACM, 2001.
 [14] Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros. Largescale study of curiositydriven learning. arXiv preprint arXiv:1808.04355, 2018.
 [15] L. Buşoniu, R. Babuška, and B. De Schutter. Multiagent reinforcement learning: An overview. In Innovations in multiagent systems and applications1, pages 183–221. Springer, 2010.
 [16] W. Chan, Y. Zhang, Q. Le, and N. Jaitly. Latent sequence decompositions. arXiv preprint arXiv:1610.03035, 2016.
 [17] M. Chatzigiorgaki and A. N. Skodras. Realtime keyframe extraction towards video content identification. In 2009 16th International conference on digital signal processing, pages 1–6. IEEE, 2009.
 [18] M. ChevalierBoisvert and L. Willems. Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gymminigrid, 2018.
 [19] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pages 2980–2988, 2015.
 [20] J. D. CoReyes, Y. Liu, A. Gupta, B. Eysenbach, P. Abbeel, and S. Levine. Selfconsistent trajectory autoencoder: Hierarchical reinforcement learning with trajectory embeddings. arXiv preprint arXiv:1806.02813, 2018.
 [21] P. Dayan and G. E. Hinton. Feudal reinforcement learning. In Advances in neural information processing systems, pages 271–278, 1993.
 [22] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv, 2018.
 [23] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655, 2014.
 [24] Z. Dwiel, M. Candadi, M. Phielipp, and A. Bansal. Hierarchical policy learning is sensitive to goal space design. arXiv preprint, (2), 2019.
 [25] A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune. Goexplore: a new approach for hardexploration problems. arXiv preprint arXiv:1901.10995, 2019.
 [26] B. Eysenbach, R. Salakhutdinov, and S. Levine. Search on the replay buffer: Bridging planning and reinforcement learning. arXiv preprint arXiv:1903.00606, 2019.
 [27] Z. Feng, R. Dearden, N. Meuleau, and R. Washington. Dynamic programming for structured continuous markov decision problems. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 154–161. AUAI Press, 2004.
 [28] C. Finn, P. Abbeel, and S. Levine. Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1126–1135, 2017.
 [29] D. Fox, W. Burgard, and S. Thrun. Active markov localization for mobile robots. Robotics and Autonomous Systems, 25(34):195–207, 1998.
 [30] R. Fox, S. Krishnan, I. Stoica, and K. Goldberg. Multilevel discovery of deep options. arXiv preprint arXiv:1703.08294, 2017.
 [31] G. Fritz, C. Seifert, L. Paletta, and H. Bischof. Attentive object detection using an information theoretic saliency measure. In International workshop on attention and performance in computational vision, pages 29–41. Springer, 2004.
 [32] D. Ghosh, A. Gupta, and S. Levine. Learning actionable representations with goalconditioned policies. arXiv preprint arXiv:1811.07819, 2018.
 [33] K. Gregor and F. Besse. Temporal difference variational autoencoder. arXiv preprint arXiv:1806.03107, 2018.
 [34] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra. Draw: A recurrent neural network for image generation. In ICML, 2015.
 [35] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep qlearning with modelbased acceleration. In International Conference on Machine Learning, pages 2829–2838, 2016.
 [36] Z. D. Guo, M. G. Azar, B. Piot, B. A. Pires, T. Pohlen, and R. Munos. Neural predictive belief representations. arXiv preprint arXiv:1811.06407, 2018.
 [37] D. Ha and J. Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.
 [38] T. Haarnoja, K. Hartikainen, P. Abbeel, and S. Levine. Latent space policies for hierarchical reinforcement learning. arXiv preprint arXiv:1804.02808, 2018.
 [39] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018.
 [40] K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. Riedmiller. Learning an embedding space for transferable robot skills. In International Conference on Learning Representations, 2018.
 [41] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforcement learning that matters. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 [42] M. Hessel, H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, and H. van Hasselt. Multitask deep reinforcement learning with popart. arXiv preprint arXiv:1809.04474, 2018.
 [43] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. In ICLR), 2017.
 [44] I. Higgins, N. Sonnerat, L. Matthey, A. Pal, C. P. Burgess, M. Bosnjak, M. Shanahan, M. Botvinick, D. Hassabis, and A. Lerchner. Scan: Learning hierarchical compositional visual concepts. arXiv preprint arXiv:1707.03389, 2017.
 [45] J. Hu, M. P. Wellman, et al. Multiagent reinforcement learning: theoretical framework and an algorithm. Citeseer, 1998.
 [46] A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):21, 2017.
 [47] D. Jayaraman, F. Ebert, A. A. Efros, and S. Levine. Timeagnostic prediction: Predicting predictable video frames. arXiv preprint arXiv:1808.07784, 2018.
 [48] Y. Jinnai, J. W. Park, D. Abel, and G. Konidaris. Discovering options for exploration by minimizing cover time. arXiv preprint arXiv:1903.00606, 2019.
 [49] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(12):99–134, 1998.
 [50] L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, et al. Modelbased reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019.
 [51] D. P. Kingma and M. Welling. Autoencoding variational bayes. In ICLR, 2013.
 [52] T. Kipf, Y. Li, H. Dai, V. Zambaldi, E. Grefenstette, P. Kohli, and P. Battaglia. Compositional imitation learning: Explaining and executing one task at a time. arXiv preprint arXiv:1812.01483, 2018.
 [53] O. Kroemer, C. Daniel, G. Neumann, H. Van Hoof, and J. Peters. Towards learning hierarchical skills for multiphase manipulation tasks. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 1503–1510. IEEE, 2015.
 [54] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pages 3675–3683, 2016.
 [55] Q. V. Le, N. Jaitly, and G. E. Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015.
 [56] A. Léon and L. Denoyer. Options discovery with budgeted reinforcement learning. arXiv preprint arXiv:1611.06824, 2016.
 [57] A. Levy, R. Platt, and K. Saenko. Hierarchical actorcritic. arXiv preprint arXiv:1712.00948, 2017.
 [58] A. Q. Li, M. Xanthidis, J. M. O’Kane, and I. Rekleitis. Active localization with dynamic obstacles. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1902–1909. IEEE, 2016.
 [59] M. L. Littman. Algorithms for sequential decision making. 1996.
 [60] C. Louizos, M. Welling, and D. P. Kingma. Learning sparse neural networks through $l\mathrm{\_}0$ regularization. arXiv preprint arXiv:1712.01312, 2017.
 [61] S. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, and M. J. Milford. Visual place recognition: A survey. IEEE Transactions on Robotics, 32(1):1–19, 2015.
 [62] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 [63] B. Marthi and C. Guestrin. Concurrent hierarchical reinforcement learning. 2005.
 [64] D. Mishkin and J. Matas. All you need is a good init. arXiv preprint arXiv:1511.06422, 2015.
 [65] V. Mnih, J. Agapiou, S. Osindero, A. Graves, O. Vinyals, K. Kavukcuoglu, et al. Strategic attentive writer for learning macroactions. arXiv preprint arXiv:1606.04695, 2016.
 [66] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
 [67] O. Nachum, S. Gu, H. Lee, and S. Levine. Nearoptimal representation learning for hierarchical reinforcement learning. arXiv preprint arXiv:1810.01257, 2018.
 [68] O. Nachum, S. S. Gu, H. Lee, and S. Levine. Dataefficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pages 3303–3313, 2018.
 [69] A. V. Nair, V. Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine. Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pages 9191–9200, 2018.
 [70] E. Nalisnick and P. Smyth. Stickbreaking variational autoencoders. arXiv preprint arXiv:1605.06197, 2016.
 [71] NASA. The scientific objectives of the mars exploration rover. 2015.
 [72] S. Niekum and S. Chitta. Incremental semantically grounded learning from demonstration. 2013.
 [73] G. Ostrovski, M. G. Bellemare, A. v. d. Oord, and R. Munos. Countbased exploration with neural density models. ICML, 2017.
 [74] Y. P. Pane, S. P. Nageshrao, and R. Babuška. Actorcritic reinforcement learning for tracking control in robotics. In Decision and Control (CDC), 2016 IEEE 55th Conference on, pages 5819–5826. IEEE, 2016.
 [75] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiositydriven exploration by selfsupervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 16–17, 2017.
 [76] K. Pertsch, O. Rybkin, J. Yang, K. Derpanis, J. Lim, K. Daniilidis, and A. Jeable. Keyin: Discovering subgoal structure with keyframebased video prediction. arXiv, 2019.
 [77] S. Racanière, T. Weber, D. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li, et al. Imaginationaugmented agents for deep reinforcement learning. In Advances in neural information processing systems, pages 5690–5701, 2017.
 [78] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019.
 [79] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014.
 [80] S. M. Ross. Introduction to stochastic dynamic programming. Academic press, 2014.
 [81] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
 [82] D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, Z. Wen, et al. A tutorial on thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1–96, 2018.
 [83] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
 [84] W. Shang, H. van Hoof, and M. Welling. Stochastic activation actorcritic methods. In ECMLPKDD, 2019.
 [85] R. S. Sutton. Introduction to reinforcement learning, volume 135. 1998.
 [86] M. E. Taylor and P. Stone. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(Jul):1633–1685, 2009.
 [87] S. Thrun. Learning metrictopological maps for indoor mobile robot navigation. Artificial Intelligence, 99(1):21–71, 1998.
 [88] Y. Tian and Q. Gong. Latent forward model for realtime strategy game planning with incomplete information. In NIPS Symposium, 2017.
 [89] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
 [90] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3540–3549. JMLR. org, 2017.
 [91] G. Weiss. Dynamic programming and markov processes., 1960.
 [92] Y. Wu, E. Mansimov, R. B. Grosse, S. Liao, and J. Ba. Scalable trustregion method for deep reinforcement learning using kroneckerfactored approximation. In NIPS, 2017.
 [93] Y. Wu and Y. Tian. Training agent for firstperson shooter game with actorcritic curriculum learning. 2016.
 [94] A. Yamaguchi and C. G. Atkeson. Neural networks and differential dynamic programming for reinforcement learning problems. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 5434–5441. IEEE, 2016.
 [95] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4651–4659, 2016.