Abstract
Deep reinforcement learning (RL) algorithms can use highcapacity deepnetworks to learn directly from image observations. However, these kinds ofobservation spaces present a number of challenges in practice, since the policymust now solve two problems: a representation learning problem, and a tasklearning problem. In this paper, we aim to explicitly learn representationsthat can accelerate reinforcement learning from images. We propose thestochastic latent actorcritic (SLAC) algorithm: a sampleefficient andhighperforming RL algorithm for learning policies for complex continuouscontrol tasks directly from highdimensional image inputs. SLAC learns acompact latent representation space using a stochastic sequential latentvariable model, and then learns a critic model within this latent space. Bylearning a critic within a compact state space, SLAC can learn much moreefficiently than standard RL methods. The proposed model improves performancesubstantially over alternative representations as well, such as variationalautoencoders. In fact, our experimental evaluation demonstrates that the sampleefficiency of our resulting method is comparable to that of modelbased RLmethods that directly use a similar type of model for control. Furthermore, ourmethod outperforms both modelfree and modelbased alternatives in terms offinal performance and sample efficiency, on a range of difficult imagebasedcontrol tasks.
Quick Read (beta)
Stochastic Latent ActorCritic:
Deep Reinforcement Learning
with a Latent Variable Model
Abstract
Deep reinforcement learning (RL) algorithms can use highcapacity deep networks to learn directly from image observations. However, these kinds of observation spaces present a number of challenges in practice, since the policy must now solve two problems: a representation learning problem, and a task learning problem. In this paper, we aim to explicitly learn representations that can accelerate reinforcement learning from images. We propose the stochastic latent actorcritic (SLAC) algorithm: a sampleefficient and highperforming RL algorithm for learning policies for complex continuous control tasks directly from highdimensional image inputs. SLAC learns a compact latent representation space using a stochastic sequential latent variable model, and then learns a critic model within this latent space. By learning a critic within a compact state space, SLAC can learn much more efficiently than standard RL methods. The proposed model improves performance substantially over alternative representations as well, such as variational autoencoders. In fact, our experimental evaluation demonstrates that the sample efficiency of our resulting method is comparable to that of modelbased RL methods that directly use a similar type of model for control. Furthermore, our method outperforms both modelfree and modelbased alternatives in terms of final performance and sample efficiency, on a range of difficult imagebased control tasks. Our code and videos of our results are available at our website.^{1}^{1} 1 https://alexleegk.github.io/slac/
positioning \usetikzlibraryarrows
Stochastic Latent ActorCritic:
Deep Reinforcement Learning
with a Latent Variable Model
Alex X. Lee Anusha Nagabandi Pieter Abbeel Sergey Levine University of California, Berkeley {alexlee_gk,nagaban2,pabbeel,svlevine}@cs.berkeley.edu
noticebox[b]Preprint. Under review.\[email protected]
1 Introduction
Deep reinforcement learning (RL) algorithms can automatically learn to solve certain tasks from raw, lowlevel observations such as images. However, these kinds of observation spaces present a number of challenges in practice: on one hand, it is difficult to directly learn from these highdimensional inputs, but on the other hand, it is also difficult to tease out a compact representation of the underlying taskrelevant information from which to learn instead. For these reasons, deep RL directly from lowlevel observations such as images remains a challenging problem. Particularly in continuous domains governed by complex dynamics, such as robotic control (Tassa et al., 2018; Brockman et al., 2016), standard approaches still require separate sensor setups to monitor details of interest in the environment, such as the joint positions of a robot or specific pose information of objects of interest. To instead be able to learn directly from the more general and rich modality of vision would greatly advance the current state of our learning systems, so we aim to study precisely this. Standard modelfree deep RL aims to use direct endtoend training to explicitly unify these tasks of representation learning and task learning. However, solving both problems together is difficult, since an effective policy requires an effective representation, but in order for an effective representation to emerge, the policy or value function must provide meaningful gradient information using only the modelfree supervision signal (i.e., the reward function). In practice, learning directly from images with standard RL algorithms can be slow, sensitive to hyperparameters, and inefficient. In contrast to endtoend learning with RL, predictive learning can benefit from a rich and informative supervision signal before the agent has even made progress on the task or received any rewards. This leads us to ask: can we explicitly learn a latent representation from raw lowlevel observations that makes deep RL easier, through learning a predictive latent variable model?
Predictive models are commonly used in modelbased RL for the purpose of planning (Deisenroth and Rasmussen, 2011; Finn and Levine, 2017; Nagabandi et al., 2018; Chua et al., 2018; Zhang et al., 2019) or generating cheap synthetic experience for RL to reduce the required amount of interaction with the real environment (Sutton, 1991; Gu et al., 2016). However, in this work, we are primarily concerned with their potential to alleviate the representation learning challenge in RL. We devise a stochastic predictive model by modeling the highdimensional observations as the consequence of a latent process, with a Gaussian prior and latent dynamics, as illustrated in Figure 1. A model with an entirely stochastic latent state has the appealing interpretation of being able to properly represent uncertainty about any of the state variables, given its past observations. We demonstrate in our work that fully stochastic state space models can in fact be learned effectively: With a welldesigned stochastic network, such models outperform fully deterministic models, and contrary to the observations in prior work (Hafner et al., 2019; Buesing et al., 2018), are actually comparable to (if not better than) mixed deterministic/stochastic models. Finally, we note that this explicit representation learning, even on lowreward data, allows an agent with such a model to make progress on representation learning even before it makes progress on task learning.
Equipped with this model, we can then perform RL in the learned latent space of the predictive model. We posit—and confirm experimentally—that our latent variable model provides a useful representation for RL. Our model represents a partially observed Markov decision process (POMDP), and solving such a POMDP exactly would be computationally intractable (Astrom, 1965; Kaelbling et al., 1998; Igl et al., 2018). We instead propose a simple approximation that trains a Markovian critic on the (stochastic) latent state and trains an actor on a history of observations and actions. The resulting stochastic latent actorcritic (SLAC) algorithm loses some of the benefits of full POMDP solvers, but it is easy and stable to train. It also produces good results, in practice, on a range of challenging problems, making it an appealing alternative to more complex POMDP solution methods.
The main contributions of our SLAC algorithm are useful representations learned from our stochastic sequential latent variable model, as well as effective RL in this learned latent space. We show experimentally that our approach substantially improves on both modelfree and modelbased RL algorithms on a range of imagebased continuous control benchmark tasks, attaining better final performance and learning more quickly than algorithms based on (a) endtoend deep RL from images, (b) learning in a latent space produced by various alternative latent variable models, such as a variational autoencoder (VAE) (Kingma and Welling, 2014), and (c) modelbased RL based on latent statespace models with mixed deterministic/stochastic variables (Hafner et al., 2019).
2 Related Work
Representation learning in RL. Endtoend deep RL can in principle learn representations directly as part of the RL process (Mnih et al., 2013). However, prior work has observed that RL has a “representation learning bottleneck”: a considerable portion of the learning period must be spent acquiring good representations of the observation space (Shelhamer et al., 2016). This motivates the use of a distinct representation learning procedure to acquire these representations before the agent has even learned to solve the task. The use of unsupervised learning to learn such representations has been explored in a number of prior works (Lange and Riedmiller, 2010; Finn et al., 2016). In contrast to this class of representation learning algorithms where temporal connections between data are not explicitly considered, we utilize a latentspace dynamics model, which we find in empirical comparisons to result in substantially better representations for RL. By modeling covariances between consecutive latent states, we make it feasible for our proposed stochastic latent actorcritic (SLAC) algorithm to perform Bellman backups directly in the latent space of the learned model. In contrast to prior work that also uses latentspace dynamical system models (Watter et al., 2015; Karl et al., 2017; Zhang et al., 2019; Hafner et al., 2019), our approach benefits from the good asymptotic performance of modelfree RL, while at the same time leveraging the improved latent space representation for sample efficiency, despite not using any modelbased rollouts for data augmentation.
Partial observability. Our work is also related to prior research on RL under partial observability. Prior work has studied endtoend RL with recurrent models (Hausknecht and Stone, 2015; Zhu et al., 2018), as well as explicit modeling of the POMDP (Astrom, 1965; Kaelbling et al., 1998) to incorporate belief state into the policy. Prior work has also proposed to train a mixed deterministic/stochastic hidden state model jointly with a policy for the purpose of solving POMDPs (Igl et al., 2018). Our approach is closely related to this prior method, in that we also use modelfree RL with a hidden state representation that is learned via prediction. However, our model learns a fully stochastic latent state, and we focus on tasks with highdimensional image observations for complex underlying continuous control tasks, in contrast to (Igl et al., 2018), which evaluates on Atari tasks and lowdimensional continuous control tasks that emphasize partial observability and knowledgegathering actions. Our method learns a critic directly on the latent state. This makes our method simpler and more scalable, but at the cost of not being able to reason about the full belief state. In this sense, the strengths of our approach and (Igl et al., 2018) are complementary, and combining their strengths would be an interesting direction for future work.
Model stochasticity. Previous works have typically observed that mixed deterministicstochastic state space models are more effective (Hafner et al., 2019; Igl et al., 2018; Buesing et al., 2018) than fully stochastic ones. In these models, the state of the underlying MDP is modeled with the deterministic state of a recurrent network (e.g., LSTM (Hochreiter and Schmidhuber, 1997) or GRU (Cho et al., 2014)), and optionally with some stochastic random variables. As mentioned earlier, a model with a latent state that is entirely stochastic has the appealing interpretation of learning a representation that can properly represent uncertainty about any of the state variables, given past observations. We demonstrate in our work that fully stochastic state space models can in fact be learned effectively and, with a welldesigned stochastic network, such models substantially outperform both fully deterministic and mixed deterministic/stochastic models.
3 Reinforcement Learning and Modeling
This work addresses the problem of learning maximum entropy policies from highdimensional observations in POMDPs, by simultaneously learning a latent representation of the underlying MDP state using variational inference and learning the policy in a maximum entropy RL framework. In this section, we describe maximum entropy RL (Ziebart, 2010; Haarnoja et al., 2018a; Levine, 2018) in fully observable MDPs, as well as variational methods for training latent state space models for POMDPs.
3.1 Maximum Entropy RL in Fully Observable MPDs
In a Markov decision process (MDP), an agent at time $t$ takes an action ${\mathbf{a}}_{t}\in \mathcal{A}$ from state ${\mathbf{s}}_{t}\in \mathcal{S}$ and reaches the next state ${\mathbf{s}}_{t+1}\in \mathcal{S}$ according to some stochastic transition dynamics $p({\mathbf{s}}_{t+1}{\mathbf{s}}_{t},{\mathbf{a}}_{t})$. The initial state ${\mathbf{s}}_{1}$ comes from a distribution $p({\mathbf{s}}_{1})$, and the agent receives a reward ${r}_{t}$ on each of the transitions. Standard RL aims to learn the parameters $\varphi $ of some policy ${\pi}_{\varphi}({\mathbf{a}}_{t}{\mathbf{s}}_{t})$ such that the expected sum of rewards is maximized under the induced trajectory distribution ${\rho}_{\pi}$. This objective can be modified to incorporate an entropy term, such that the policy also aims to maximize the expected entropy $\mathscr{H}({\pi}_{\varphi}(\cdot {\mathbf{s}}_{t}))$ under the induced trajectory distribution ${\rho}_{\pi}$. This formulation has a close connection to variational inference (Ziebart, 2010; Haarnoja et al., 2018a; Levine, 2018), and we build on this in our work. The resulting maximum entropy objective is
$${\varphi}^{*}=\mathrm{arg}\underset{\varphi}{\mathrm{max}}\sum _{t=1}^{T}\underset{({\mathbf{s}}_{t},{\mathbf{a}}_{t})\sim {\rho}_{\pi}}{\mathbb{E}}[r({\mathbf{s}}_{t},{\mathbf{a}}_{t})+\alpha \mathscr{H}({\pi}_{\varphi}(\cdot {\mathbf{s}}_{t}))],$$  (1) 
where $r$ is the reward function, and $\alpha $ is a temperature parameter that controls the tradeoff between optimizing for the reward and for the entropy (i.e., stochasticy) of the policy. Soft actorcritic (SAC) (Haarnoja et al., 2018a) uses this maximum entropy RL framework to derive soft policy iteration, which alternates between policy evaluation and policy improvement within the described maximum entropy framework. SAC then extends this soft policy iteration to handle continuous action spaces by using parameterized function approximators to represent both the Qfunction ${Q}_{\theta}$ (critic) and the policy ${\pi}_{\varphi}$ (actor). The soft Qfunction parameters $\theta $ are optimized to minimize the soft Bellman residual,
${J}_{Q}(\theta )$  $=\underset{({\mathbf{s}}_{t},{\mathbf{a}}_{t},{r}_{t},{\mathbf{s}}_{t+1})\sim \mathcal{D}}{\mathbb{E}}\left[{\displaystyle \frac{1}{2}}\left({Q}_{\theta}({\mathbf{s}}_{t},{\mathbf{a}}_{t})({r}_{t}+\gamma {V}_{\overline{\theta}}({\mathbf{s}}_{t+1}))\right)\right],$  (2)  
${V}_{\overline{\theta}}({\mathbf{s}}_{t+1})$  $=\underset{{\mathbf{a}}_{t+1}\sim {\pi}_{\varphi}}{\mathbb{E}}\left[{Q}_{\overline{\theta}}({\mathbf{s}}_{t+1},{\mathbf{a}}_{t+1})\alpha \mathrm{log}{\pi}_{\varphi}({\mathbf{a}}_{t+1}{\mathbf{s}}_{t+1})\right],$  (3) 
where $\mathcal{D}$ is the replay buffer, $\gamma $ is the discount factor, and $\overline{\theta}$ are delayed parameters. The policy parameters $\varphi $ are optimized to update the policy towards the exponential of the soft Qfunction,
$${J}_{\pi}(\varphi )=\underset{{\mathbf{s}}_{t}\sim \mathcal{D}}{\mathbb{E}}\left[\underset{{\mathbf{a}}_{t}\sim {\pi}_{\varphi}}{\mathbb{E}}\left[\alpha \mathrm{log}({\pi}_{\varphi}({\mathbf{a}}_{t}{\mathbf{s}}_{t})){Q}_{\theta}({\mathbf{s}}_{t},{\mathbf{a}}_{t})\right]\right].$$  (4) 
Results of this stochastic, entropy maximizing RL framework demonstrate improved robustness and stability. SAC also shows the sample efficiency benefits of an offpolicy learning algorithm, in conjunction with the high performance benefits of a longhorizon planning algorithm. Precisely for these reasons, we choose to extend the SAC algorithm in this work to formulate our SLAC algorithm.
3.2 Sequential Latent Variable Models and Amortized Variational Inference in POMDPs
To learn representations for RL, we use latent variable models trained with amortized variational inference. The learned model must be able to process a large number of pixels that are present in the entangled image $\mathbf{x}$, and it must tease out the relevant information into a compact and disentangled representation $\mathbf{z}$. To learn such a model, we can consider maximizing the probability of each observed datapoint $\mathbf{x}$ from some training set $\mathcal{D}$ under the entire generative process $p(\mathbf{x})=\int p(\mathbf{x}\mathbf{z})p(\mathbf{z})d\mathbf{z}$. This objective is intractable to compute in general due to the marginalization of the latent variables $\mathbf{z}$. In amortized variational inference, we utilize the following bound on the loglikelihood (Kingma and Welling, 2014),
$${E}_{\mathbf{x}\sim D}\left[\mathrm{log}p(\mathbf{x})\right]\ge {E}_{\mathbf{x}\sim D}[{E}_{\mathbf{z}\sim q}\left[\mathrm{log}p(\mathbf{x}\mathbf{z})\right]{\mathrm{D}}_{\mathrm{KL}}(q(\mathbf{z}\mathbf{x})\parallel p(\mathbf{z}))].$$  (5) 
We can maximize the probability of the observed datapoints (i.e., the left hand side of Equation (missing)) by learning an encoder $q(\mathbf{z}\mathbf{x})$ and a decoder $p(\mathbf{x}\mathbf{z})$, and then directly performing gradient ascent on the right hand side of the equation. In this setup, the distributions of interest are the prior $p(\mathbf{z})$, the observation model $p(\mathbf{x}\mathbf{z})$, and the posterior $q(\mathbf{z}\mathbf{x})$.
Although such generative models have been shown to successfully model various types of complex distributions (Kingma and Welling, 2014) by embedding knowledge of the distribution into an informative latent space, they do not have a builtin mechanism for the use of temporal information when performing inference. In the case of partially observable environments, as we discuss below, the representative latent state ${\mathbf{z}}_{t}$ corresponding to a given nonMarkovian observation ${\mathbf{x}}_{t}$ needs to be informed by past observations.
Consider a partially observable MDP (POMDP), where an action ${\mathbf{a}}_{t}\in \mathcal{A}$ from latent state ${\mathbf{z}}_{t}\in \mathcal{Z}$ results in latent state ${\mathbf{z}}_{t+1}\in \mathcal{Z}$ and emits a corresponding observation ${\mathbf{x}}_{t+1}\in \mathcal{X}$. We make an explicit distinction between an observation ${\mathbf{x}}_{t}$ and the underlying latent state ${\mathbf{z}}_{t}$, to emphasize that the latter is unobserved and the distribution is not known a priori. Analogous to the fully observable MDP, the initial state distribution is $p({\mathbf{z}}_{1})$, the transition probability distribution is $p({\mathbf{z}}_{t+1}{\mathbf{z}}_{t},{\mathbf{a}}_{t})$, and the reward is ${r}_{t}$. In addition, the observation model is given by $p({\mathbf{x}}_{t}{\mathbf{z}}_{t})$.
As in the case for VAEs, a generative model of these observations ${\mathbf{x}}_{t}$ can be learned by maximizing the loglikelihood. In the POMDP setting, however, we note that ${\mathbf{x}}_{t}$ alone does not provide all necessary information to infer ${\mathbf{z}}_{t}$, and thus, prior temporal information must be taken into account. This brings us to the discussion of sequential latent variable models. The distributions of interest are the priors $p({\mathbf{z}}_{1})$ and $p({\mathbf{z}}_{t+1}{\mathbf{z}}_{t},{\mathbf{a}}_{t})$, the observation model $p({\mathbf{x}}_{t}{\mathbf{z}}_{t})$, and the approximate posteriors $q({\mathbf{z}}_{1}{\mathbf{x}}_{1})$ and $q({\mathbf{z}}_{t+1}{\mathbf{x}}_{t+1},{\mathbf{z}}_{t},{\mathbf{a}}_{t})$. The loglikehood of the observations can then be bounded, similarly to the VAE bound in Equation (missing), as
$$\begin{array}{c}\mathrm{log}p({\mathbf{x}}_{1:\tau +1}{\mathbf{a}}_{1:\tau})\ge \underset{{\mathbf{z}}_{1:\tau +1}\sim q}{\mathbb{E}}[\sum _{t=1}^{\tau +1}\mathrm{log}p({\mathbf{x}}_{t}{\mathbf{z}}_{t}){\mathrm{D}}_{\mathrm{KL}}(q({\mathbf{z}}_{1}{\mathbf{x}}_{1})\parallel p({\mathbf{z}}_{1}))\hfill \\ \hfill +\sum _{t=1}^{\tau}{\mathrm{D}}_{\mathrm{KL}}(q({\mathbf{z}}_{t+1}{\mathbf{x}}_{t+1},{\mathbf{z}}_{t},{\mathbf{a}}_{t})\parallel p({\mathbf{z}}_{t+1}{\mathbf{z}}_{t},{\mathbf{a}}_{t}))].\end{array}$$  (6) 
Prior work (Hafner et al., 2019; Buesing et al., 2018; Doerr et al., 2018) has explored modeling such nonMarkovian observation sequences, using methods such as recurrent neural networks with deterministic hidden state, as well as probabilistic statespace models. In this work, we enable the effective training of a fully stochastic sequential latent variable model, and bring it together with a maximum entropy actorcritic RL algorithm to create SLAC: a sampleefficient and highperforming RL algorithm for learning policies for complex continuous control tasks directly from highdimensional image inputs.
4 Joint Modeling and Control as Inference
Our method aims to learn maximum entropy policies from highdimensional, nonMarkovian observations in a POMDP, while also learning a model of that POMDP. The model alleviates the representation learning problem, which in turn helps with the policy learning problem. We formulate the control problem as inference in a probabilistic graphical model with latent variables, as shown in Figure 1.
For a fully observable MDP, the control problem can be embedded into a graphical model by introducing a binary random variable ${\mathcal{O}}_{t}$, which indicates if time step $t$ is optimal. When its distribution is chosen to be $p({\mathcal{O}}_{t}=1{\mathbf{s}}_{t},{\mathbf{a}}_{t})\propto \mathrm{exp}(r({\mathbf{s}}_{t},{\mathbf{a}}_{t}))$, then maximization of $p({\mathcal{O}}_{1:T})$ via approximate inference in that model yields the optimal policy for the maximum entropy objective (Levine, 2018).
In a POMDP setting, the distribution can analogously be given by $p({\mathcal{O}}_{t}=1{\mathbf{z}}_{t},{\mathbf{a}}_{t})\propto \mathrm{exp}(r({\mathbf{z}}_{t},{\mathbf{a}}_{t}))$. Instead of maximizing the likelihood of the optimality variables alone, we jointly model the observations (including the observed rewards of the past time steps) and learn maximum entropy policies by maximizing the marginal likelihood $p({\mathbf{x}}_{1:\tau +1},{\mathcal{O}}_{\tau +1:T}{\mathbf{a}}_{1:\tau})$. This objective represents both the likelihood of the observed data from the past $\tau $ steps, as well as the optimality of the agent’s actions for future steps. We factorize our variational distribution into a product of recognition terms $q({\mathbf{z}}_{1}{\mathbf{x}}_{1})$ and $q({\mathbf{z}}_{t+1}{\mathbf{x}}_{t+1},{\mathbf{z}}_{t},{\mathbf{a}}_{t})$, and a policy term $\pi ({\mathbf{a}}_{\tau +1}{\mathbf{x}}_{1:\tau +1},{\mathbf{a}}_{1:\tau})$ that is independent of $\mathbf{z}$:
$$q({\mathbf{z}}_{1:\tau +1},{\mathbf{a}}_{\tau +1}{\mathbf{x}}_{1:\tau +1},{\mathbf{a}}_{1:\tau})=q({\mathbf{z}}_{1}{\mathbf{x}}_{1})\prod _{t=1}^{\tau}q({\mathbf{z}}_{t+1}{\mathbf{x}}_{t+1},{\mathbf{z}}_{t},{\mathbf{a}}_{t})\pi ({\mathbf{a}}_{\tau +1}{\mathbf{x}}_{1:\tau +1},{\mathbf{a}}_{1:\tau}).$$  (7) 
The posterior over the actions represents the agent’s policy $\pi $. We make a design choice in conditioning this policy directly on observations and actions, instead of the latent state, because this approximation allows us to directly execute the policy without having to perform inference on the latent belief state at run time. We use the posterior from Equation (missing) and maximize the evidence lower bound (ELBO) of the marginal likelihood,
$$\begin{array}{c}\mathrm{log}p({\mathbf{x}}_{1:\tau +1},{\mathcal{O}}_{\tau +1:T}{\mathbf{a}}_{1:\tau})\ge \underset{({\mathbf{z}}_{1:\tau +1},{\mathbf{a}}_{1:\tau})\sim q}{\mathbb{E}}[\sum _{t=1}^{\tau +1}\mathrm{log}p({\mathbf{x}}_{t}{\mathbf{z}}_{t})\hfill \\ \hfill {\mathrm{D}}_{\mathrm{KL}}(q({\mathbf{z}}_{1}{\mathbf{x}}_{1})\parallel p({\mathbf{z}}_{1}))\sum _{t=1}^{\tau}{\mathrm{D}}_{\mathrm{KL}}(q({\mathbf{z}}_{t+1}{\mathbf{x}}_{t+1},{\mathbf{z}}_{t},{\mathbf{a}}_{t})\parallel p({\mathbf{z}}_{t+1}{\mathbf{z}}_{t},{\mathbf{a}}_{t}))\\ \hfill +r({\mathbf{z}}_{\tau +1},{\mathbf{a}}_{\tau +1})\mathrm{log}\pi ({\mathbf{a}}_{\tau +1}{\mathbf{x}}_{1:\tau +1},{\mathbf{a}}_{1:\tau})+V({\mathbf{z}}_{\tau +1})].\end{array}$$  (8) 
This derivation assumes that the reward function, which determines $p(\mathcal{O}\mathbf{x},\mathbf{a})$, is known. However, in many RL problems, this is not the case. In that situation, we can simply append the reward to the observation, and learn the reward along with $p(\mathbf{x}\mathbf{z})$. This requires no modification to our method other than changing the observation space, and we use this approach in all of our experiments.
5 Stochastic Latent Actor Critic
We now describe our stochastic latent actor critic (SLAC) algorithm, which approximately maximizes the ELBO of Equation (missing) using function approximators to model the prior and posterior distributions.
Latent Variable Model: The first part of the ELBO corresponds to training the latent variable model to maximize the likelihood of the observations, analogous to the ELBO in Equation (missing) for the sequential latent variable model. The distributions of the latent variable model are diagonal Gaussian distributions, where the means and variances are outputs of neural networks. The distribution parameters $\psi $ of this model are optimized to maximize the first part of the ELBO,
$$\begin{array}{c}{J}_{M}(\psi )=\underset{({\mathbf{x}}_{1:\tau +1},{\mathbf{a}}_{1:\tau},{r}_{1:\tau})\sim \mathcal{D}}{\mathbb{E}}[\underset{{\mathbf{z}}_{1:\tau +1}\sim {q}_{\psi}}{\mathbb{E}}[\sum _{t=1}^{\tau +1}\mathrm{log}{p}_{\psi}({\mathbf{x}}_{t}{\mathbf{z}}_{t})\hfill \\ \hfill {\mathrm{D}}_{\mathrm{KL}}({q}_{\psi}({\mathbf{z}}_{1}{\mathbf{x}}_{1})\parallel {p}_{\psi}({\mathbf{z}}_{1}))\sum _{t=1}^{\tau}{\mathrm{D}}_{\mathrm{KL}}({q}_{\psi}({\mathbf{z}}_{t+1}{\mathbf{x}}_{t+1},{\mathbf{z}}_{t},{\mathbf{a}}_{t})\parallel {p}_{\psi}({\mathbf{z}}_{t+1}{\mathbf{z}}_{t},{\mathbf{a}}_{t}))]].\end{array}$$  (9) 
We use the reparameterization trick to sample from the filtering distribution ${q}_{\psi}({\mathbf{z}}_{1:\tau +1}{\mathbf{x}}_{1:\tau +1},{\mathbf{a}}_{1:\tau})$.
Critic and Actor: The second part of the ELBO corresponds to the maximum entropy RL objective. As in the fully observable case from Section 3.1 and as described by Levine (Levine, 2018), this optimization can be solved via message passing of soft Qvalues, except that we use the latent states $\mathbf{z}$ rather than the standard states $\mathbf{s}$. For continuous state and action spaces, this message passing is approximated by minimizing the soft Bellman residual, which we use to train our soft Qfunction parameters $\theta $,
$$\begin{array}{c}{J}_{Q}(\theta )=\underset{({\mathbf{x}}_{1:\tau +1},{\mathbf{a}}_{1:\tau},{r}_{\tau})\sim \mathcal{D}}{\mathbb{E}}[\underset{{\mathbf{z}}_{1:\tau +1}\sim {q}_{\psi}}{\mathbb{E}}[\frac{1}{2}({Q}_{\theta}({\mathbf{z}}_{\tau},{\mathbf{a}}_{\tau})\hfill \\ \hfill {({r}_{\tau}+\gamma \underset{{\mathbf{a}}_{\tau +1}\sim {\pi}_{\varphi}}{\mathbb{E}}[{Q}_{\overline{\theta}}({\mathbf{z}}_{\tau +1},{\mathbf{a}}_{\tau +1})\alpha \mathrm{log}{\pi}_{\varphi}({\mathbf{a}}_{\tau +1}{\mathbf{x}}_{1:\tau +1},{\mathbf{a}}_{1:\tau})]))}^{2}]],\end{array}$$  (10) 
where $\overline{\theta}$ are delayed parameters, obtained as exponential moving averages of $\theta $. Notice that the latents ${\mathbf{z}}_{\tau}$ and ${\mathbf{z}}_{\tau +1}$, which are used in the Bellman backup, are sampled from the same joint, i.e. ${\mathbf{z}}_{\tau +1}\sim {q}_{\psi}({\mathbf{z}}_{\tau +1}{\mathbf{x}}_{\tau +1},{\mathbf{z}}_{\tau},{\mathbf{a}}_{\tau})$. Next, the policy parameters $\varphi $ are optimized to update the policy towards the exponential of that soft Qfunction, analogously to soft actorcritic (Haarnoja et al., 2018a) as shown in Equation (missing):
$${J}_{\pi}(\varphi )=\underset{({\mathbf{x}}_{1:\tau +1},{\mathbf{a}}_{1:\tau})\sim \mathcal{D}}{\mathbb{E}}\left[\underset{{\mathbf{z}}_{1:\tau +1}\sim {q}_{\psi}}{\mathbb{E}}\left[\underset{{\mathbf{a}}_{\tau +1}\sim {\pi}_{\varphi}}{\mathbb{E}}\left[\alpha \mathrm{log}{\pi}_{\varphi}({\mathbf{a}}_{\tau +1}{\mathbf{x}}_{1:\tau +1},{\mathbf{a}}_{1:\tau}){Q}_{\theta}({\mathbf{z}}_{\tau +1},{\mathbf{a}}_{\tau +1})\right]\right]\right].$$  (11) 
We again use the reparameterization trick to sample from the policy, and the policy loss only uses the last sample ${\mathbf{z}}_{\tau +1}$ of the sequence. Finally, we note that for the expectation over latent states in the Bellman residual in Equation (missing), rather than sampling latent states $\mathbf{z}\sim \mathcal{Z}$, we sample latent states from the filtering distribution ${q}_{\psi}({\mathbf{z}}_{1:\tau +1}{\mathbf{x}}_{1:\tau +1},{\mathbf{a}}_{1:\tau})$. This design choice allows us to minimize the critic loss for samples that are most relevant for $Q$, while also allowing the critic loss to use the Qfunction in the same way as implied by the policy loss in Equation (missing).
SLAC is outlined in Section 5. The actorcritic component follows prior work, with automatic tuning of the temperature $\alpha $ and two Qfunctions to mitigate underestimation (Fujimoto et al., 2018; Haarnoja et al., 2018a, b). SLAC can be viewed as a variant of SAC (Haarnoja et al., 2018a) where the critic is trained on the stochastic latent state of our sequential latent variable model. The backup for the critic is performed on a tuple $({\mathbf{z}}_{\tau},{\mathbf{a}}_{\tau},{r}_{\tau},{\mathbf{z}}_{\tau +1})$, sampled from the posterior $q({\mathbf{z}}_{\tau +1},{\mathbf{z}}_{\tau}{\mathbf{x}}_{1:\tau +1},{\mathbf{a}}_{1:\tau})$. The critic can, in principle, take advantage of the perfect knowledge of the state ${\mathbf{z}}_{t}$, which makes learning easier. However, the actor does not have access to ${\mathbf{z}}_{t}$, and must make decisions based on a history of observations and actions. SLAC is not a modelbased algorithm, in that in does not use the model for prediction, but we see in our experiments that SLAC has better sample efficiency than prior modelbased algorithms.
\[email protected]@algorithmic \Require$E$, $\psi $, ${\theta}_{1}$, ${\theta}_{2}$, $\varphi $ \CommentEnvironment and initial parameters for model, actor, and critic \State${\mathbf{x}}_{1}\sim {E}_{\text{reset}}()$ \CommentSample initial observation from the environment \State$\mathcal{D}\leftarrow ({\mathbf{x}}_{1})$ \CommentInitialize replay buffer with initial observation \Foreach iteration \Foreach environment step \State${\mathbf{a}}_{t}\sim {\pi}_{\varphi}({\mathbf{a}}_{t}{\mathbf{x}}_{1:t},{\mathbf{a}}_{1:t1})$ \CommentSample action from the policy \State${r}_{t},{\mathbf{x}}_{t+1}\sim {E}_{\text{step}}({\mathbf{a}}_{t})$ \CommentSample transition from the environment \State$\mathcal{D}\leftarrow \mathcal{D}\cup ({\mathbf{a}}_{t},{r}_{t},{\mathbf{x}}_{t+1})$ \CommentStore the transition in the replay buffer \EndFor\Foreach gradient step \State$\psi \leftarrow \psi {\lambda}_{M}{\nabla}_{\psi}{J}_{M}(\psi )$ \CommentUpdate model weights \State${\theta}_{i}\leftarrow {\theta}_{i}{\lambda}_{Q}{\nabla}_{{\theta}_{i}}{J}_{Q}({\theta}_{i})$ for $i\in \{1,2\}$ \CommentUpdate the Qfunction weights \State$\varphi \leftarrow \varphi {\lambda}_{\pi}{\nabla}_{\varphi}{J}_{\pi}(\varphi )$ \CommentUpdate policy weights \State${\overline{\theta}}_{i}\leftarrow \nu {\theta}_{i}+(1\nu ){\overline{\theta}}_{i}$ for $i\in \{1,2\}$ \CommentUpdate target critic network weights \EndFor\EndFor
6 Latent Variable Model
We briefly summarize our full model architecture here, with full details in Appendix A. The architecture in our implementation factorizes the latent variable ${\mathbf{z}}_{t}$ into two stochastic layers, ${\mathbf{z}}_{t}^{1}$ and ${\mathbf{z}}_{t}^{2}$, as shown in Figure 2. We found this design to produce high quality reconstructions and samples, and utilize it in all of our experiments. The generative model $p$ and the inference model $q$ are given by
${p}_{\psi}({\mathbf{z}}_{1})$  $={p}_{\psi}({\mathbf{z}}_{1}^{2}{\mathbf{z}}_{1}^{1})p({\mathbf{z}}_{1}^{1}),$  
${p}_{\psi}({\mathbf{z}}_{t+1}{\mathbf{z}}_{t},{\mathbf{a}}_{t})$  $={p}_{\psi}({\mathbf{z}}_{t+1}^{2}{\mathbf{z}}_{t+1}^{1},{\mathbf{z}}_{t}^{2},{\mathbf{a}}_{t}){p}_{\psi}({\mathbf{z}}_{t+1}^{1}{\mathbf{z}}_{t}^{2},{\mathbf{a}}_{t}),$  
${q}_{\psi}({\mathbf{z}}_{1}{\mathbf{x}}_{1})$  $={p}_{\psi}({\mathbf{z}}_{1}^{2}{\mathbf{z}}_{1}^{1}){q}_{\psi}({\mathbf{z}}_{1}^{1}{\mathbf{x}}_{1}),$  
${q}_{\psi}({\mathbf{z}}_{t+1}{\mathbf{x}}_{t+1},{\mathbf{z}}_{t},{\mathbf{a}}_{t})$  $={p}_{\psi}({\mathbf{z}}_{t+1}^{2}{\mathbf{z}}_{t+1}^{1},{\mathbf{z}}_{t}^{2},{\mathbf{a}}_{t}){q}_{\psi}({\mathbf{z}}_{t+1}^{1}{\mathbf{x}}_{t+1},{\mathbf{z}}_{t}^{2},{\mathbf{a}}_{t}).$ 
Note that we choose the variational distribution $q$ over ${\mathbf{z}}_{t}^{2}$ to be the same as the model $p$. Thus, the KL divergence in ${J}_{M}$ simplifies to the divergence between $q$ and $p$ over ${\mathbf{z}}_{t}^{1}$. We use a multivariate standard normal distribution for $p({\mathbf{z}}_{1}^{1})$, since it is not conditioned on any variables, i.e. ${\mathbf{z}}_{1}^{1}\sim \mathcal{N}(\mathrm{\U0001d7ce},\bm{I})$. The conditional distributions of our model are diagonal Gaussian, with means and variances given by neural networks. Unlike models from prior work (Hafner et al., 2019; Buesing et al., 2018; Doerr et al., 2018), which have deterministic and stochastic paths and use recurrent neural networks, ours is fully stochastic, i.e. our latent state is a Markovian latent random variable formed by the concatenation of ${\mathbf{z}}_{t}^{1}$ and ${\mathbf{z}}_{t}^{2}$. Further details are discussed in Appendix A.
7 Experimental Evaluation
We evaluate SLAC on numerous imagebased continuous control tasks from both the DeepMind Control Suite (Tassa et al., 2018) and OpenAI Gym (Brockman et al., 2016), as illustrated in Figure 3. We also evaluate on multiple simulated manipulation tasks with two different robotic manipulators. Full details of SLAC’s network architecture are described in Appendix A. Aside from the value of action repeats (i.e. control frequency) for the tasks, we kept all of SLAC’s hyperparameters constant across all tasks in all domains. Training and evaluation details are given in Appendix B, and image samples from our model for all tasks are shown in Appendix D. Additionally, visualizations of our results and code are available on the project website.^{2}^{2} 2 https://alexleegk.github.io/slac/
7.1 Comparative Evaluation on Continuous Control Benchmark Tasks
To provide a comparative evaluation against prior methods, we evaluate SLAC on four tasks (cheetah run, walker walk, ballincup catch, finger spin) from the DeepMind Control Suite (Tassa et al., 2018), and four tasks (cheetah, walker, ant, hopper) from OpenAI Gym (Brockman et al., 2016). Note that the Gym tasks are typically used with lowdimensional state observations, while we evaluate on them with raw image observations. We compare our method to the following stateoftheart modelbased and modelfree algorithms:
SAC (Haarnoja et al., 2018a): This is an offpolicy actorcritic algorithm, which represents a comparison to stateoftheart modelfree learning. We include experiments showing the performance of SAC based on true state (as an upper bound on performance) as well as directly from raw images.
D4PG (BarthMaron et al., 2018): This is also an offpolicy actorcritic algorithm, learning directly from raw images. The results reported in the plots below are the performance after ${10}^{8}$ training steps, as stated in the benchmarks from (Tassa et al., 2018).
PlaNet (Hafner et al., 2019): This is a modelbased RL method for learning from images, which uses a mixed deterministic/stochastic sequential latent variable model, but without explicit policy learning. Instead, the model is used for planning with model predictive control (MPC), where each plan is optimized with the cross entropy method (CEM).
DVRL (Igl et al., 2018): This is an onpolicy modelfree RL algorithm that also trains a mixed deterministic/stochastic latentvariable POMDP model. DVRL uses the full belief over the latent state as input into both the actor and critic, as opposed to our method, which trains the critic with the latent state and the actor with a history of actions and observations.
Our experiments on the DeepMind Control Suite in Figure 5 show that the sample efficiency of SLAC is comparable or better than both modelbased and modelfree alternatives. This indicates that overcoming the representation learning bottleneck, coupled with efficient offpolicy RL, provides for fast learning similar to modelbased methods, while attaining final performance comparable to fully modelfree techniques that learn from state. SLAC also substantially outperforms DVRL. This difference can be explained in part by the use of an efficient offpolicy RL algorithm, which can better take advantage of the learned representation.
We also evaluate SLAC on continuous control benchmark tasks from OpenAI Gym in Figure 5. We notice that these tasks are much more challenging than the DeepMind Control Suite tasks, because the rewards are not as shaped and not bounded between 0 and 1, the dynamics are different, and the episodes terminate on failure (e.g., when the hopper or walker falls over). PlaNet is unable to solve the last three tasks, while for the cheetah task, it learns a suboptimal policy that involves flipping the cheetah over and pushing forward while on its back. To better understand the performance of fixedhorizon MPC on these tasks, we also evaluated with the ground truth dynamics (i.e., the true simulator), and found that even in this case, MPC did not achieve good final performance, suggesting that infinite horizon policy optimization, of the sort performed by SLAC and modelfree algorithms, is important to attain good results on these tasks.
For the plots in Figure 5 and Figure 5, we show the mean and standard deviation over 2 random seeds with 10 evaluation trajectories per seed. We compare all methods using a fixed action repeat per task, except for the reported D4PG numbers from (Tassa et al., 2018), which we could not rerun ourselves due to lack of opensource code. Due to action repeats, the number of samples used for training is only a fraction of the environment steps reported in our plots. For example, an episode with $1000$ environment steps at an action repeat of $4$ only receives $250$ observations. See Appendix B for details.
Our experiments show that SLAC successfully learns complex continuous control benchmark tasks from raw image inputs, while demonstrating both high sample efficiency as well as high task performance through the use of explicit representation learning in conjunction with a robust maximumentropy actorcritic RL algorithm.
7.2 Simulated Robotic Manipulation Tasks
Beyond standard benchmark tasks, we also aim to illustrate the flexibility of our method by demonstrating it on a variety of imagebased robotic manipulation skills. The reward functions for these tasks can be found in Appendix C, and videos of these behaviors can be found on the supplementary website^{3}^{3} 3 https://alexleegk.github.io/slac/. In Figure 6, we show illustrations of SLAC’s execution of these manipulation tasks of using a simulated Sawyer robotic arm to push open a door, close a drawer, and reach out and pick up an object. Our method is able to learn these contactrich manipulation tasks from raw images, succeeding even when the object of interest actually occupies only a small portion of the image.
In our next set of manipulation experiments, we use the 9DoF 3fingered DClaw robot to rotate a valve (Zhu et al., 2019) from various starting positions to various desired goal locations, where the goal is illustrated as a green dot in the image. In all of our experiments, we allow the starting position of the valve to be selected randomly between $[\pi ,\pi ]$, and we test three different settings for the goal location (see Figure 7). First, we prescribe the goal position to always be fixed. In this task setting, we see that SLAC, SAC from images, and SAC from state all perform similarly in terms of both sample efficiency as well as final performance. However, when we allow the goal location to be selected randomly from a set of 3 options $\{\frac{\pi}{2},0,\frac{\pi}{2}\}$, we see that SLAC and SAC from images actually outperform SAC from state. This interesting result can perhaps be explained by the fact that, when learning from state, the goal is specified with just a single number within the state vector, rather than with the redundancy of numerous green pixels in the image. Finally, when we allow the goal position of the valve to be selected randomly from $[\frac{\pi}{2},\frac{\pi}{2}]$, we see that SLAC’s explicit representation learning improves substantially over imagebased SAC, performing comparable to the oracle baseline that receives the true state observation.
7.3 Evaluating the Latent Variable Model
We next study the tradeoffs between different design choices for the latent variable model. We compare our fully stochastic model, as described in Section 6, to a standard nonsequential VAE model (Kingma and Welling, 2014), which has been used in multiple prior works for representation learning in RL (Higgins et al., 2017; Ha and Schmidhuber, 2018; Nair et al., 2018),
the mixed deterministic/stochastic model used by PlaNet (Hafner et al., 2019), as well as three variants of our model: a simple filtering model that does not factorize the latent variable into two layers of stochastic units, a fully deterministic model that removes all stochasticity from the hidden state dynamics, and a mixed model that has both deterministic and stochastic transitions, similar to the PlaNet model, but with our architecture. In all cases, we use the RL framework of SLAC and only vary the choice of model for representation learning. As shown in the comparison in Figure 8, our fully stochastic model substantially outperforms prior models as well as the deterministic and simple variants of our own model. Although even the naïve VAE model provides for fast learning in the beginning, the performance of these models quickly saturates. The mixed deterministic/stochastic variant of our model nearly matches the performance of the final fully stochastic variant but, contrary to the conclusions in prior work (Hafner et al., 2019; Buesing et al., 2018), the fully stochastic model performs on par or better, while retaining the appealing interpretation of a stochastic state space belief model.
7.4 Qualitative Predictions from the Latent Variable Model
We show example image samples from our learned sequential latent variable model for the cheetah task in Figure 9, and we include the other tasks in the appendix. Samples from the posterior show the images ${\mathbf{x}}_{t}$ as constructed by the decoder ${p}_{\psi}({\mathbf{x}}_{t}{\mathbf{z}}_{t})$, using a sequence of latents ${\mathbf{z}}_{t}$ that are encoded and sampled from the posteriors, ${q}_{\psi}({\mathbf{z}}_{1}{\mathbf{x}}_{1})$ and ${q}_{\psi}({\mathbf{z}}_{t+1}{\mathbf{x}}_{t+1},{\mathbf{z}}_{t},{\mathbf{a}}_{t})$. Samples from the prior, on the other hand, use a sequence of latents where ${\mathbf{z}}_{1}$ is sampled from $p({\mathbf{z}}_{1})$ and all remaining latents ${\mathbf{z}}_{t}$ are from the propagation of the previous latent state through the latent dynamics ${p}_{\psi}({\mathbf{z}}_{t+1}{\mathbf{z}}_{t},{\mathbf{a}}_{t})$. Note that these prior samples do not use any image frames as inputs, and thus they do not correspond to any ground truth sequence. We also show samples from the conditional prior, which is conditioned on the first image from the true sequence: for this, the sampling procedure is the same as the prior, except that ${\mathbf{z}}_{1}$ is encoded and sampled from the posterior ${q}_{\psi}({\mathbf{z}}_{1}{\mathbf{x}}_{1})$, rather than being sampled from $p({\mathbf{z}}_{1})$. We notice that the generated images samples can be sharper and more realistic by using a smaller variance for ${p}_{\psi}({\mathbf{x}}_{t}{\mathbf{z}}_{t})$ when training the model, but at the expense of a representation that leads to lower returns. Finally, note that we do not actually use the samples from the prior for training.
\makecell
Cheetah run 

Truth  \adjustboxvalign=c,margin=0 1pt 
Sample  \adjustboxvalign=c,margin=0 1pt 
Prior Sample  \adjustboxvalign=c,margin=0 1pt 
Sample  \adjustboxvalign=c,margin=0 1pt 
8 Discussion and Future Work
We presented SLAC, an efficient RL algorithm for learning from highdimensional image inputs that combines efficient offpolicy modelfree RL with representation learning via a sequential stochastic state space model. Through representation learning in conjunction with effective task learning in the learned latent space, our method achieves improved sample efficiency and final task performance as compared to both prior modelbased and modelfree RL methods.
While our current SLAC algorithm is fully modelfree, in that predictions from the model are not utilized to speed up training, a natural extension of our approach would be to use the model predictions themselves to generate synthetic samples. Incorporating this additional synthetic modelbased data into a mixed modelbased/modelfree method could further improve sample efficiency and performance. More broadly, the use of explicit representation learning with RL has the potential to not only accelerate training time and increase the complexity of achievable tasks, but also enable reuse and transfer of our learned representation across tasks.
Acknowledgments
We thank Marvin Zhang, Abhishek Gupta, and Chelsea Finn for useful discussions and feedback, Danijar Hafner for providing timely assistance with PlaNet, and Maximilian Igl for providing timely assistance with DVRL. We also thank Deirdre Quillen, Tianhe Yu, and Chelsea Finn for providing us with their suite of Sawyer manipulation tasks. This research was supported by the National Science Foundation through IIS1651843 and IIS1700697, as well as ARL DCIST CRA W911NF1720181 and the Office of Naval Research. Compute support was provided by NVIDIA.
References
 Astrom [1965] K. J. Astrom. Optimal control of markov processes with incomplete state information. Journal of mathematical analysis and applications, 1965.
 BarthMaron et al. [2018] G. BarthMaron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, A. Muldal, N. Heess, and T. Lillicrap. Distributed distributional deterministic policy gradients. In International Conference on Learning Representations (ICLR), 2018.
 Brockman et al. [2016] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
 Buesing et al. [2018] L. Buesing, T. Weber, S. Racanière, S. M. A. Eslami, D. J. Rezende, D. P. Reichert, F. Viola, F. Besse, K. Gregor, D. Hassabis, and D. Wierstra. Learning and querying fast generative models for reinforcement learning. arXiv preprint arXiv:1802.03006, 2018.
 Cho et al. [2014] K. Cho, B. van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
 Chua et al. [2018] K. Chua, R. Calandra, R. McAllister, and S. Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Neural Information Processing Systems (NIPS), 2018.
 Deisenroth and Rasmussen [2011] M. Deisenroth and C. E. Rasmussen. PILCO: A modelbased and dataefficient approach to policy search. In International Conference on Machine Learning (ICML), 2011.
 Doerr et al. [2018] A. Doerr, C. Daniel, M. Schiegg, D. NguyenTuong, S. Schaal, M. Toussaint, and S. Trimpe. Probabilistic recurrent statespace models. In International Conference on Machine Learning (ICML), 2018.
 Finn and Levine [2017] C. Finn and S. Levine. Deep visual foresight for planning robot motion. In International Conference on Robotics and Automation (ICRA), 2017.
 Finn et al. [2016] C. Finn, X. Y. Tan, Y. Duan, T. Darrell, S. Levine, and P. Abbeel. Deep spatial autoencoders for visuomotor learning. In International Conference on Robotics and Automation (ICRA), 2016.
 Fujimoto et al. [2018] S. Fujimoto, H. Hoof, and D. Meger. Addressing function approximation error in actorcritic methods. In International Conference on Machine Learning (ICML), 2018.
 Gu et al. [2016] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep qlearning with modelbased acceleration. In International Conference on Machine Learning (ICML), 2016.
 Ha and Schmidhuber [2018] D. Ha and J. Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.
 Haarnoja et al. [2018a] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning (ICML), 2018a.
 Haarnoja et al. [2018b] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine. Soft actorcritic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018b.
 Hafner et al. [2019] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent dynamics for planning from pixels. In International Conference on Machine Learning (ICML), 2019.
 Hausknecht and Stone [2015] M. Hausknecht and P. Stone. Deep recurrent Qlearning for partially observable MDPs. In AAAI Fall Symposium on Sequential Decision Making for Intelligent Agents, 2015.
 Higgins et al. [2017] I. Higgins, A. Pal, A. Rusu, L. Matthey, C. Burgess, A. Pritzel, M. Botvinick, C. Blundell, and A. Lerchner. DARLA: Improving zeroshot transfer in reinforcement learning. In International Conference on Machine Learning (ICML), 2017.
 Hochreiter and Schmidhuber [1997] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural computation, 1997.
 Igl et al. [2018] M. Igl, L. Zintgraf, T. A. Le, F. Wood, and S. Whiteson. Deep variational reinforcement learning for POMDPs. In International Conference on Machine Learning (ICML), 2018.
 Kaelbling et al. [1998] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(12):99–134, 1998.
 Karl et al. [2017] M. Karl, M. Soelch, J. Bayer, and P. van der Smagt. Deep variational bayes filters: Unsupervised learning of state space models from raw data. In International Conference on Learning Representations (ICLR), 2017.
 Kingma and Ba [2015] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
 Kingma and Welling [2014] D. P. Kingma and M. Welling. Autoencoding variational bayes. In International Conference on Learning Representations (ICLR), 2014.
 Lange and Riedmiller [2010] S. Lange and M. Riedmiller. Deep autoencoder neural networks in reinforcement learning. In International Joint Conference on Neural Networks (IJCNN), 2010.
 Levine [2018] S. Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
 Mnih et al. [2013] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller. Playing Atari with deep reinforcement learning. In NIPS Deep Learning Workshop, 2013.
 Nagabandi et al. [2018] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine. Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In International Conference on Robotics and Automation (ICRA), 2018.
 Nair et al. [2018] A. V. Nair, V. Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine. Visual reinforcement learning with imagined goals. In Neural Information Processing Systems (NIPS), 2018.
 Shelhamer et al. [2016] E. Shelhamer, P. Mahmoudieh, M. Argus, and T. Darrell. Loss is its own reward: Selfsupervision for reinforcement learning. arXiv preprint arXiv:1612.07307, 2016.
 Sutton [1991] R. S. Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 2(4):160–163, 1991.
 Tassa et al. [2018] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, T. Lillicrap, and M. Riedmiller. DeepMind control suite. arXiv preprint arXiv:1801.00690, 2018.
 Watter et al. [2015] M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Neural Information Processing Systems (NIPS), 2015.
 Zhang et al. [2019] M. Zhang, S. Vikram, L. Smith, P. Abbeel, M. J. Johnson, and S. Levine. SOLAR: Deep structured latent representations for modelbased reinforcement learning. In International Conference on Machine Learning (ICML), 2019.
 Zhu et al. [2019] H. Zhu, A. Gupta, A. Rajeswaran, S. Levine, and V. Kumar. Dexterous manipulation with deep reinforcement learning: Efficient, general, and lowcost. In International Conference on Robotics and Automation (ICRA), 2019.
 Zhu et al. [2018] P. Zhu, X. Li, P. Poupart, and G. Miao. On improving deep reinforcement learning for POMDPs. arXiv preprint arXiv:1804.06309, 2018.
 Ziebart [2010] B. D. Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. 2010.
Appendix A Network Architectures
Recall that our full sequential latent variable model has two layers of latent variables, which we denote ${\mathbf{z}}_{t}^{1}$ and ${\mathbf{z}}_{t}^{2}$. We found this design to provide a good balance between ease of training and expressivity, producing good reconstructions and generations and, crucially, providing good representations for reinforcement learning. For reference, we reproduce the model diagram from the main paper in Figure 10. Note that this diagram represents the Bayes net corresponding to our full model. However, since all of the latent variables are stochastic, this visualization also presents the design of the computation graph. Inference over the latent variables is performed using amortized variational inference, with all training done via reparameterization. Hence, the computation graph can be deduced from the diagram by treating all solid arrows as part of the generative model and all dashed arrows as part of approximate posterior. The generative model consists of the following probability distributions, as described in the main paper:
${\mathbf{z}}_{1}^{1}$  $\sim p({\mathbf{z}}_{1}^{1})$  
${\mathbf{z}}_{1}^{2}$  $\sim {p}_{\psi}({\mathbf{z}}_{1}^{2}{\mathbf{z}}_{1}^{1})$  
${\mathbf{z}}_{t+1}^{1}$  $\sim {p}_{\psi}({\mathbf{z}}_{t+1}^{1}{\mathbf{z}}_{t}^{2},{\mathbf{a}}_{t})$  
${\mathbf{z}}_{t+1}^{2}$  $\sim {p}_{\psi}({\mathbf{z}}_{t+1}^{2}{\mathbf{z}}_{t+1}^{1},{\mathbf{z}}_{t}^{2},{\mathbf{a}}_{t})$  
${\mathbf{x}}_{t}$  $\sim {p}_{\psi}({\mathbf{x}}_{t}{\mathbf{z}}_{t}^{1},{\mathbf{z}}_{t}^{2})$  
${r}_{t}$  $\sim {p}_{\psi}({r}_{t}{\mathbf{z}}_{t}^{1},{\mathbf{z}}_{t}^{2},{\mathbf{a}}_{t},{\mathbf{z}}_{t+1}^{1},{\mathbf{z}}_{t+1}^{2}).$ 
The initial distribution $p({\mathbf{z}}_{1}^{1})$ is a multivariate standard normal distribution $\mathcal{N}(\mathrm{\U0001d7ce},\bm{I})$. All of the other distributions are conditional and parametrized by neural networks with parameters $\psi $. The networks for ${p}_{\psi}({\mathbf{z}}_{1}^{2}{\mathbf{z}}_{1}^{1})$, ${p}_{\psi}({\mathbf{z}}_{t+1}^{1}{\mathbf{z}}_{t}^{2},{\mathbf{a}}_{t})$, ${p}_{\psi}({\mathbf{z}}_{t+1}^{2}{\mathbf{z}}_{t+1}^{1},{\mathbf{z}}_{t}^{2},{\mathbf{a}}_{t})$, and ${p}_{\psi}({r}_{t}{\mathbf{z}}_{t}^{1},{\mathbf{z}}_{t}^{2},{\mathbf{a}}_{t},{\mathbf{z}}_{t+1}^{1},{\mathbf{z}}_{t+1}^{2})$ consist of two fully connected layers, each with 256 hidden units, and a Gaussian output layer. The Gaussian layer is defined such that it outputs a multivariate normal distribution with diagonal variance, where the mean is the output of a linear layer and the diagonal standard deviation is the output of a fully connected layer with softplus nonlinearity. The observation model ${p}_{\psi}({\mathbf{x}}_{t}{\mathbf{z}}_{t}^{1},{\mathbf{z}}_{t}^{2})$ consists of 5 transposed convolutional layers (256 $4\times 4$, 128 $3\times 3$, 64 $3\times 3$, 32 $3\times 3$, and 3 $5\times 5$ filters, respectively, stride 2 each, except for the first layer). The output variance for each image pixel is fixed to $0.1$.
The variational distribution $q$, also referred to as the inference model or the posterior, is represented by the following factorization:
${\mathbf{z}}_{1}^{1}$  $\sim {q}_{\psi}({\mathbf{z}}_{1}^{1}{\mathbf{x}}_{1})$  
${\mathbf{z}}_{1}^{2}$  $\sim {p}_{\psi}({\mathbf{z}}_{1}^{2}{\mathbf{z}}_{1}^{1})$  
${\mathbf{z}}_{t+1}^{1}$  $\sim {q}_{\psi}({\mathbf{z}}_{t+1}^{1}{\mathbf{x}}_{t+1},{\mathbf{z}}_{t}^{2},{\mathbf{a}}_{t})$  
${\mathbf{z}}_{t+1}^{2}$  $\sim {p}_{\psi}({\mathbf{z}}_{t+1}^{2}{\mathbf{z}}_{t+1}^{1},{\mathbf{z}}_{t}^{2},{\mathbf{a}}_{t}).$ 
Note that the variational distribution over ${\mathbf{z}}_{1}^{2}$ and ${\mathbf{z}}_{t+1}^{2}$ is intentionally chosen to exactly match the generative model $p$, such that this term does not appear in the KLdivergence within the ELBO, and a separate variational distribution is only learned over ${\mathbf{z}}_{1}^{1}$ and ${\mathbf{z}}_{t+1}^{1}$. This intentional design decision simplifies the inference process. The networks representing the distributions ${q}_{\psi}({\mathbf{z}}_{1}^{1}{\mathbf{x}}_{1})$ and ${q}_{\psi}({\mathbf{z}}_{t+1}^{1}{\mathbf{x}}_{t+1},{\mathbf{z}}_{t}^{2},{\mathbf{a}}_{t})$ both consist of 5 convolutional layers (32 $5\times 5$, 64 $3\times 3$, 128 $3\times 3$, 256 $3\times 3$, and 256 $4\times 4$ filters, respectively, stride 2 each, except for the last layer), 2 fully connected layers (256 units each), and a Gaussian output layer. The parameters of the convolution layers are shared among both distributions.
The latent variables have 32 and 256 dimensions, respectively, i.e. ${\mathbf{z}}_{t}^{1}\in {\mathbb{R}}^{32}$ and ${\mathbf{z}}_{t}^{2}\in {\mathbb{R}}^{256}$. For the image observations, ${\mathbf{x}}_{t}\in {[0,1]}^{64\times 64\times 3}$. All the layers, except for the output layers, use leaky ReLU nonlinearities. Note that there are no deterministic recurrent connections in the network—all networks are feedforward, and the temporal dependencies all flow through the stochastic units ${\mathbf{z}}_{t}^{1}$ and ${\mathbf{z}}_{t}^{2}$
For the reinforcement learning process, we use a critic network ${Q}_{\theta}$ consisting of 2 fully connected layers (256 units each) and a linear output layer. The actor network ${\pi}_{\varphi}$ consists of 5 convolutional layers, 2 fully connected layers (256 units each), a Gaussian layer, and a tanh bijector, which constrains the actions to be in the bounded action space of $[1,1]$. The convolutional layers are the same as the ones from the latent variable model, but the parameters of these layers are not updated by the actor objective. The same exact network architecture is used for every one of the experiments in the paper.
Appendix B Training and Evaluation Details
The control portion of our algorithm uses the same hyperparameters as SAC (Haarnoja et al., 2018a), except for a smaller replay buffer size of 100000 environment steps (instead of a million) due to the high memory usage of image observations. All of the parameters are trained with the Adam optimizer (Kingma and Ba, 2015), and we perform one gradient step per environment step. The Qfunction and policy parameters are trained with a learning rate of 0.0003 and a batch size of 256. The model parameters are trained with a learning rate of 0.0001 and a batch size of 32. We use sequences of length $\tau =8$ for all the tasks. Note that the sequence length can be less than $\tau $ for the first $t$ steps ($$) of each episode.
We use action repeats for all the methods, except for D4PG for which we use the reported results from prior work (Tassa et al., 2018). The number of environment steps reported in our plots correspond to the unmodified steps of the benchmarks. Note that the methods that use action repeats only use a fraction of the environment steps reported in our plots. For example, 3 million environment steps of the cheetah task correspond to 750000 samples when using an action repeat of 4. The action repeats used in our experiments are given in Table 1.
Unlike in prior work (Haarnoja et al., 2018a, b), we use the same stochastic policy as both the behavioral and evaluation policy since we found the deterministic greedy policy to be comparable or worse than the stochastic policy.
Benchmark  Task  \makecellAction  
repeat  \makecellOriginal control  
time step  \makecellEffective control  
time step  
DeepMind Control Suite  cheetah run  4  0.01  0.04 
walker walk  2  0.025  0.05  
ballincup catch  4  0.02  0.08  
finger spin  2  0.02  0.04  
\aboverulesep3.0pt15\cdashline.51\cdashline.5 

\cdashlineplus1fil minus1fil
\belowrulesep 

OpenAI Gym  HalfCheetahv2  1  0.05  0.05 
Walker2dv2  4  0.008  0.032  
Hopperv2  2  0.008  0.016  
Antv2  4  0.05  0.2 
Appendix C Reward Functions for the Manipulation Tasks
Sawyer Door Open
$r={\theta}_{\text{door}}{\theta}_{\text{desired}}$  (12) 
Sawyer Drawer Close
${d}_{\text{handToHandle}}$  $={xy{z}_{\text{hand}}xy{z}_{\text{handle}}}_{2}^{2}$  
${d}_{\text{drawerToGoal}}$  $={x}_{\text{drawer}}{x}_{\text{goal}}$  
$r$  $={d}_{\text{handToHandle}}{d}_{\text{drawerToGoal}}$  (13) 
Sawyer Pickup
${d}_{\text{toObj}}$  $={xy{z}_{\text{hand}}xy{z}_{\text{obj}}}_{2}^{2}$  
${r}_{\text{reach}}$  $=0.25(1\text{tanh}(10.0*{d}_{\text{toObj}}))$  
${r}_{\text{lift}}$  $=\{\begin{array}{cc}1\hfill & \text{if object lifted}\hfill \\ 0\hfill & \text{else}\hfill \end{array}$ 
$r$  $=\text{max}({r}_{\text{reach}},{r}_{\text{lift}})$  (14) 
Appendix D Additional samples from our model
We show additional samples from our model in Figure 11, Figure 12, and Figure 13. Samples from the posterior show the images ${\mathbf{x}}_{t}$ as constructed by the decoder ${p}_{\psi}({\mathbf{x}}_{t}{\mathbf{z}}_{t})$, using a sequence of latents ${\mathbf{z}}_{t}$ that are encoded and sampled from the posteriors, ${q}_{\psi}({\mathbf{z}}_{1}{\mathbf{x}}_{1})$ and ${q}_{\psi}({\mathbf{z}}_{t+1}{\mathbf{x}}_{t+1},{\mathbf{z}}_{t},{\mathbf{a}}_{t})$. Samples from the prior, on the other hand, use a sequence of latents where ${\mathbf{z}}_{1}$ is sampled from $p({\mathbf{z}}_{1})$ and all remaining latents ${\mathbf{z}}_{t}$ are from the propagation of the previous latent state through the latent dynamics ${p}_{\psi}({\mathbf{z}}_{t+1}{\mathbf{z}}_{t},{\mathbf{a}}_{t})$. These samples do not use any image frames as inputs, and thus they do not correspond to any ground truth sequence. We also show samples from the conditional prior, which is conditioned on the first image from the true sequence: for this, the sampling procedure is the same as the prior, except that ${\mathbf{z}}_{1}$ is encoded and sampled from the posterior ${q}_{\psi}({\mathbf{z}}_{1}{\mathbf{x}}_{1})$, rather than being sampled from $p({\mathbf{z}}_{1})$.
\makecell
Walker walk 

Truth  \adjustboxvalign=c,margin=0 1pt 
Sample  \adjustboxvalign=c,margin=0 1pt 
Prior Sample  \adjustboxvalign=c,margin=0 1pt 
Sample  \adjustboxvalign=c,margin=0 1pt 
\makecell
Ballincup catch 

Truth  \adjustboxvalign=c,margin=0 1pt 
Sample  \adjustboxvalign=c,margin=0 1pt 
Prior Sample  \adjustboxvalign=c,margin=0 1pt 
Sample  \adjustboxvalign=c,margin=0 1pt 
\makecell
Finger spin 

Truth  \adjustboxvalign=c,margin=0 1pt 
Sample  \adjustboxvalign=c,margin=0 1pt 
Prior Sample  \adjustboxvalign=c,margin=0 1pt 
Sample  \adjustboxvalign=c,margin=0 1pt 
\makecell
HalfCheetahv2 

Truth  \adjustboxvalign=c,margin=0 1pt 
Sample  \adjustboxvalign=c,margin=0 1pt 
Prior Sample  \adjustboxvalign=c,margin=0 1pt 
Sample  \adjustboxvalign=c,margin=0 1pt 
\makecell
Walker2dv2 

Truth  \adjustboxvalign=c,margin=0 1pt 
Sample  \adjustboxvalign=c,margin=0 1pt 
Prior Sample  \adjustboxvalign=c,margin=0 1pt 
Sample  \adjustboxvalign=c,margin=0 1pt 
\makecell
Hopperv2 

Truth  \adjustboxvalign=c,margin=0 1pt 
Sample  \adjustboxvalign=c,margin=0 1pt 
Prior Sample  \adjustboxvalign=c,margin=0 1pt 
Sample  \adjustboxvalign=c,margin=0 1pt 
\makecell
Antv2 

Truth  \adjustboxvalign=c,margin=0 1pt 
Sample  \adjustboxvalign=c,margin=0 1pt 
Prior Sample  \adjustboxvalign=c,margin=0 1pt 
Sample  \adjustboxvalign=c,margin=0 1pt 
\makecell
Sawyer Door Open 

Truth  \adjustboxvalign=c,margin=0 1pt 
Sample  \adjustboxvalign=c,margin=0 1pt 
Prior Sample  \adjustboxvalign=c,margin=0 1pt 
Sample  \adjustboxvalign=c,margin=0 1pt 
\makecell
Sawyer Drawer Close 

Truth  \adjustboxvalign=c,margin=0 1pt 
Sample  \adjustboxvalign=c,margin=0 1pt 
Prior Sample  \adjustboxvalign=c,margin=0 1pt 
Sample  \adjustboxvalign=c,margin=0 1pt 
\makecell
Sawyer Pickup 

Truth  \adjustboxvalign=c,margin=0 1pt 
Sample  \adjustboxvalign=c,margin=0 1pt 
Prior Sample  \adjustboxvalign=c,margin=0 1pt 
Sample  \adjustboxvalign=c,margin=0 1pt 