Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model

  • 2020-02-03 15:07:43
  • Alex X. Lee, Anusha Nagabandi, Pieter Abbeel, Sergey Levine
  • 0

Abstract

Deep reinforcement learning (RL) algorithms can use high-capacity deepnetworks to learn directly from image observations. However, these kinds ofobservation spaces present a number of challenges in practice, since the policymust now solve two problems: a representation learning problem, and a tasklearning problem. In this paper, we aim to explicitly learn representationsthat can accelerate reinforcement learning from images. We propose thestochastic latent actor-critic (SLAC) algorithm: a sample-efficient andhigh-performing RL algorithm for learning policies for complex continuouscontrol tasks directly from high-dimensional image inputs. SLAC learns acompact latent representation space using a stochastic sequential latentvariable model, and then learns a critic model within this latent space. Bylearning a critic within a compact state space, SLAC can learn much moreefficiently than standard RL methods. The proposed model improves performancesubstantially over alternative representations as well, such as variationalautoencoders. In fact, our experimental evaluation demonstrates that the sampleefficiency of our resulting method is comparable to that of model-based RLmethods that directly use a similar type of model for control. Furthermore, ourmethod outperforms both model-free and model-based alternatives in terms offinal performance and sample efficiency, on a range of difficult image-basedcontrol tasks. Our code and videos of our results are available at our website.

 

Quick Read (beta)

Stochastic Latent Actor-Critic:
Deep Reinforcement Learning
with a Latent Variable Model

Alex X. Lee1,2        Anusha Nagabandi1        Pieter Abbeel1        Sergey Levine1
1University of California, Berkeley
2DeepMind
{alexlee_gk,nagaban2,pabbeel,svlevine}@cs.berkeley.edu
Abstract

Deep reinforcement learning (RL) algorithms can use high-capacity deep networks to learn directly from image observations. However, these kinds of observation spaces present a number of challenges in practice, since the policy must now solve two problems: a representation learning problem, and a task learning problem. In this paper, we aim to explicitly learn representations that can accelerate reinforcement learning from images. We propose the stochastic latent actor-critic (SLAC) algorithm: a sample-efficient and high-performing RL algorithm for learning policies for complex continuous control tasks directly from high-dimensional image inputs. SLAC learns a compact latent representation space using a stochastic sequential latent variable model, and then learns a critic model within this latent space. By learning a critic within a compact state space, SLAC can learn much more efficiently than standard RL methods. The proposed model improves performance substantially over alternative representations as well, such as variational autoencoders. In fact, our experimental evaluation demonstrates that the sample efficiency of our resulting method is comparable to that of model-based RL methods that directly use a similar type of model for control. Furthermore, our method outperforms both model-free and model-based alternatives in terms of final performance and sample efficiency, on a range of difficult image-based control tasks. Our code and videos of our results are available at our website.11 1 https://alexlee-gk.github.io/slac/

\usetikzlibrary

positioning \usetikzlibraryarrows

Stochastic Latent Actor-Critic:
Deep Reinforcement Learning
with a Latent Variable Model

Alex X. Lee1,2        Anusha Nagabandi1        Pieter Abbeel1        Sergey Levine1
1University of California, Berkeley
2DeepMind
{alexlee_gk,nagaban2,pabbeel,svlevine}@cs.berkeley.edu

1 Introduction

Deep reinforcement learning (RL) algorithms can automatically learn to solve certain tasks from raw, low-level observations such as images. However, these kinds of observation spaces present a number of challenges in practice: on one hand, it is difficult to directly learn from these high-dimensional inputs, but on the other hand, it is also difficult to tease out a compact representation of the underlying task-relevant information from which to learn instead. For these reasons, deep RL directly from low-level observations such as images remains a challenging problem. Particularly in continuous domains governed by complex dynamics, such as robotic control, standard approaches still require separate sensor setups to monitor details of interest in the environment, such as the joint positions of a robot or pose information of objects of interest. To instead be able to learn directly from the more general and rich modality of vision would greatly advance the current state of our learning systems, so we aim to study precisely this. Standard model-free deep RL aims to use direct end-to-end training to explicitly unify these tasks of representation learning and task learning. However, solving both problems together is difficult, since an effective policy requires an effective representation, but in order for an effective representation to emerge, the policy or value function must provide meaningful gradient information using only the model-free supervision signal (i.e., the reward function). In practice, learning directly from images with standard RL algorithms can be slow, sensitive to hyperparameters, and inefficient. In contrast to end-to-end learning with RL, predictive learning can benefit from a rich and informative supervision signal before the agent has even made progress on the task or received any rewards. This leads us to ask: can we explicitly learn a latent representation from raw low-level observations that makes deep RL easier, through learning a predictive latent variable model?

Predictive models are commonly used in model-based RL for the purpose of planning (deisenroth2011pilco; finn2017deep; nagabandi2018neural; chua2018deep; zhang2019solar) or generating cheap synthetic experience for RL to reduce the required amount of interaction with the real environment (sutton1991dyna; gu2016continuous). However, in this work, we are primarily concerned with their potential to alleviate the representation learning challenge in RL. We devise a stochastic predictive model by modeling the high-dimensional observations as the consequence of a latent process, with a Gaussian prior and latent dynamics, as illustrated in Figure 1. A model with an entirely stochastic latent state has the appealing interpretation of being able to properly represent uncertainty about any of the state variables, given its past observations. We demonstrate in our work that fully stochastic state space models can in fact be learned effectively: With a well-designed stochastic network, such models outperform fully deterministic models, and contrary to the observations in prior work (hafner2019learning; buesing2018learning), are actually comparable to partially stochastic models. Finally, we note that this explicit representation learning, even on low-reward data, allows an agent with such a model to make progress on representation learning even before it makes progress on task learning.

Equipped with this model, we can then perform RL in the learned latent space of the predictive model. We posit—and confirm experimentally—that our latent variable model provides a useful representation for RL. Our model represents a partially observed Markov decision process (POMDP), and solving such a POMDP exactly would be computationally intractable (astrom1965optimal; kaelbling1998planning; igl2018deep). We instead propose a simple approximation that trains a Markovian critic on the (stochastic) latent state and trains an actor on a history of observations and actions. The resulting stochastic latent actor-critic (SLAC) algorithm loses some of the benefits of full POMDP solvers, but it is easy and stable to train. It also produces good results, in practice, on a range of challenging problems, making it an appealing alternative to more complex POMDP solution methods.

The main contributions of our SLAC algorithm are useful representations learned from our stochastic sequential latent variable model, as well as effective RL in this learned latent space. We show experimentally that our approach substantially improves on both model-free and model-based RL algorithms on a range of image-based continuous control benchmark tasks, attaining better final performance and learning more quickly than algorithms based on (a) end-to-end deep RL from images, (b) learning in a latent space produced by various alternative latent variable models, such as a variational autoencoder (VAE) (kingma2013auto), and (c) model-based RL based on latent state-space models with partially stochastic variables (hafner2019learning).

2 Related Work

Representation learning in RL. End-to-end deep RL can in principle learn representations directly as part of the RL process (mnih2013dqn). However, prior work has observed that RL has a “representation learning bottleneck”: a considerable portion of the learning period must be spent acquiring good representations of the observation space (shelhamer2016loss). This motivates the use of a distinct representation learning procedure to acquire these representations before the agent has even learned to solve the task. The use of auxiliary supervision in RL to learn such representations has been explored in a number of prior works (lange2010deep; finn2016deep; jaderberg2016auxiliary; higgins2017darla; ha2018world; nair2018visual; oord2018cpc; gelada2019deepmdp; dadashi2019polytope). In contrast to this class of representation learning algorithms, we explicitly learn a latent variable model of the POMDP, in which the latent representation and latent-space dynamics are jointly learned. By modeling covariances between consecutive latent states, we make it feasible for our proposed algorithm to perform Bellman backups directly in the latent space of the learned model.

Partial observability in RL. Our work is also related to prior research on RL under partial observability. Prior work has studied exact and approximate solutions to POMDPs, but they require explicit models of the POMDP and are only practical for simpler domains (kaelbling1998planning). Recent work has proposed end-to-end RL methods that use recurrent neural networks to process histories of observations and (sometimes) actions, but without constructing a model of the POMDP (hausknecht2015drqn; foerster2016ddqrn; zhu2018adrqn). Other works, however, learn latent-space dynamical system models and then use them to solve the POMDP with model-based RL (watter2015embed; wahlstrom2015pixels; karl2017deep; zhang2019solar; hafner2019learning). Although some of these works learn latent variable models that are similar to ours, these model-based methods are often limited by compounding model errors and finite horizon optimization. In contrast to these works, our approach does not use the model for prediction and performs infinite horizon policy optimization. Our approach benefits from the good asymptotic performance of model-free RL, while at the same time leveraging the improved latent space representation for sample efficiency. Other works have also trained latent variable models and used their representations as the inputs to model-free RL algorithms. They use representations encoded from latent states sampled from the forward model (buesing2018learning), belief representations obtained from particle filtering (igl2018deep), or belief representations obtained directly from a learned belief-space forward model (gregor2019shaping). Our approach is closely related to these prior methods, in that we also use model-free RL with a latent state representation that is learned via prediction. However, instead of using belief representations, our method learns a critic directly on latent states samples.

Sequential latent variable models. Several previous works have explored various modeling choices to learn stochastic sequential models (krishnan2015dkf; archer2015blackbox; karl2016dvbf; fraccaro2016srnn; fraccaro2017disentangled; doerr2018prssm). In the context of using sequential models for RL, previous works have typically observed that partially stochastic state space models are more effective than fully stochastic ones (buesing2018learning; igl2018deep; hafner2019learning). In these models, the state of the underlying MDP is modeled with the deterministic state of a recurrent network (e.g., LSTM (hochreiter1997lstm) or GRU (cho2014gru)), and optionally with some stochastic random variables. As mentioned earlier, a model with a latent state that is entirely stochastic has the appealing interpretation of learning a representation that can properly represent uncertainty about any of the state variables, given past observations. We demonstrate in our work that fully stochastic state space models can in fact be learned effectively and, with a well-designed stochastic network, such models perform on par to partially stochastic models and outperform fully deterministic models.

3 Reinforcement Learning and Modeling

This work addresses the problem of learning maximum entropy policies from high-dimensional observations in POMDPs, by simultaneously learning a latent representation of the underlying MDP state using variational inference and learning the policy in a maximum entropy RL framework. In this section, we describe maximum entropy RL (ziebart2010modeling; haarnoja2018soft; levine2018reinforcement) in fully observable MDPs, as well as variational methods for training latent state space models for POMDPs.

3.1 Maximum Entropy RL in Fully Observable MDPs

In a Markov decision process (MDP), an agent at time t takes an action 𝐚t𝒜 from state 𝐬t𝒮 and reaches the next state 𝐬t+1𝒮 according to some stochastic transition dynamics p(𝐬t+1|𝐬t,𝐚t). The initial state 𝐬1 comes from a distribution p(𝐬1), and the agent receives a reward rt on each of the transitions. Standard RL aims to learn the parameters ϕ of some policy πϕ(𝐚t|𝐬t) such that the expected sum of rewards is maximized under the induced trajectory distribution ρπ. This objective can be modified to incorporate an entropy term, such that the policy also aims to maximize the expected entropy (πϕ(|𝐬t)) under the induced trajectory distribution ρπ. This formulation has a close connection to variational inference (ziebart2010modeling; haarnoja2018soft; levine2018reinforcement), and we build on this in our work. The resulting maximum entropy objective is

ϕ*=argmaxϕt=1T𝔼(𝐬t,𝐚t)ρπ[r(𝐬t,𝐚t)+α(πϕ(|𝐬t))], (1)

where r is the reward function, and α is a temperature parameter that controls the trade-off between optimizing for the reward and for the entropy (i.e., stochasticity) of the policy. Soft actor-critic (SAC) (haarnoja2018soft) uses this maximum entropy RL framework to derive soft policy iteration, which alternates between policy evaluation and policy improvement within the described maximum entropy framework. SAC then extends this soft policy iteration to handle continuous action spaces by using parameterized function approximators to represent both the Q-function Qθ (critic) and the policy πϕ (actor). The soft Q-function parameters θ are optimized to minimize the soft Bellman residual,

JQ(θ) =𝔼(𝐬t,𝐚t,rt,𝐬t+1)𝒟[12(Qθ(𝐬t,𝐚t)-(rt+γVθ¯(𝐬t+1)))2], (2)
Vθ¯(𝐬t+1) =𝔼𝐚t+1πϕ[Qθ¯(𝐬t+1,𝐚t+1)-αlogπϕ(𝐚t+1|𝐬t+1)], (3)

where 𝒟 is the replay buffer, γ is the discount factor, and θ¯ are delayed parameters. The policy parameters ϕ are optimized to update the policy towards the exponential of the soft Q-function,

Jπ(ϕ)=𝔼𝐬t𝒟[𝔼𝐚tπϕ[αlog(πϕ(𝐚t|𝐬t))-Qθ(𝐬t,𝐚t)]]. (4)

Results of this stochastic, entropy maximizing RL framework demonstrate improved robustness and stability. SAC also shows the sample efficiency benefits of an off-policy learning algorithm, in conjunction with the high performance benefits of a long-horizon planning algorithm. Precisely for these reasons, we choose to extend the SAC algorithm in this work to formulate our SLAC algorithm.

3.2 Sequential Latent Variable Models and Amortized Variational Inference in POMDPs

To learn representations for RL, we use latent variable models trained with amortized variational inference. The learned model must be able to process a large number of pixels that are present in the entangled image 𝐱, and it must tease out the relevant information into a compact and disentangled representation 𝐳. To learn such a model, we can consider maximizing the probability of each observed datapoint 𝐱 from some training set 𝒟 under the entire generative process p(𝐱)=p(𝐱|𝐳)p(𝐳)d𝐳. This objective is intractable to compute in general due to the marginalization of the latent variables 𝐳. In amortized variational inference, we utilize the following bound on the log-likelihood (kingma2013auto),

E𝐱D[logp(𝐱)]E𝐱D[E𝐳q[logp(𝐱|𝐳)]-DKL(q(𝐳|𝐱)p(𝐳))]. (5)

We can maximize the probability of the observed datapoints (i.e., the left hand side of Equation (missing)) by learning an encoder q(𝐳|𝐱) and a decoder p(𝐱|𝐳), and then directly performing gradient ascent on the right hand side of the equation. In this setup, the distributions of interest are the prior p(𝐳), the observation model p(𝐱|𝐳), and the posterior q(𝐳|𝐱).

Although such generative models have been shown to successfully model various types of complex distributions (kingma2013auto) by embedding knowledge of the distribution into an informative latent space, they do not have a built-in mechanism for the use of temporal information when performing inference. In the case of partially observable environments, as we discuss below, the representative latent state 𝐳t corresponding to a given non-Markovian observation 𝐱t needs to be informed by past observations.

Consider a partially observable MDP (POMDP), where an action 𝐚t𝒜 from latent state 𝐳t𝒵 results in latent state 𝐳t+1𝒵 and emits a corresponding observation 𝐱t+1𝒳. We make an explicit distinction between an observation 𝐱t and the underlying latent state 𝐳t, to emphasize that the latter is unobserved and the distribution is not known a priori. Analogous to the fully observable MDP, the initial state distribution is p(𝐳1), the transition probability distribution is p(𝐳t+1|𝐳t,𝐚t), and the reward is rt. In addition, the observation model is given by p(𝐱t|𝐳t).

As in the case for VAEs, a generative model of these observations 𝐱t can be learned by maximizing the log-likelihood. In the POMDP setting, however, we note that 𝐱t alone does not provide all necessary information to infer 𝐳t, and thus, prior temporal information must be taken into account. This brings us to the discussion of sequential latent variable models. The distributions of interest are the priors p(𝐳1) and p(𝐳t+1|𝐳t,𝐚t), the observation model p(𝐱t|𝐳t), and the approximate posteriors q(𝐳1|𝐱1) and q(𝐳t+1|𝐱t+1,𝐳t,𝐚t). The log-likehood of the observations can then be bounded, similarly to the VAE bound in Equation (missing), as

logp(𝐱1:τ+1|𝐚1:τ)𝔼𝐳1:τ+1q[t=1τ+1logp(𝐱t|𝐳t)-DKL(q(𝐳1|𝐱1)p(𝐳1))-t=1τDKL(q(𝐳t+1|𝐱t+1,𝐳t,𝐚t)p(𝐳t+1|𝐳t,𝐚t))]. (6)

Prior work (hafner2019learning; buesing2018learning; doerr2018probabilistic) has explored modeling such non-Markovian observation sequences, using methods such as recurrent neural networks with deterministic hidden state, as well as probabilistic state-space models. In this work, we enable the effective training of a fully stochastic sequential latent variable model, and bring it together with a maximum entropy actor-critic RL algorithm to create SLAC: a sample-efficient and high-performing RL algorithm for learning policies for complex continuous control tasks directly from high-dimensional image inputs.

4 Joint Modeling and Control as Inference

\includestandalone

figures/pgm

Figure 1: Graphical model of POMDP with optimality variables for tτ+1.

Our method aims to learn maximum entropy policies from high-dimensional, non-Markovian observations in a POMDP, while also learning a model of that POMDP. The model alleviates the representation learning problem, which in turn helps with the policy learning problem. We formulate the control problem as inference in a probabilistic graphical model with latent variables, as shown in Figure 1.

For a fully observable MDP, the control problem can be embedded into a graphical model by introducing a binary random variable 𝒪t, which indicates if time step t is optimal. When its distribution is chosen to be p(𝒪t=1|𝐬t,𝐚t)=exp(r(𝐬t,𝐚t)), then maximization of p(𝒪1:T) via approximate inference in that model yields the optimal policy for the maximum entropy objective (levine2018reinforcement).

In a POMDP setting, the distribution can analogously be given by p(𝒪t=1|𝐳t,𝐚t)=exp(r(𝐳t,𝐚t)). Instead of maximizing the likelihood of the optimality variables alone, we jointly model the observations (including the observed rewards of the past time steps) and learn maximum entropy policies by maximizing the marginal likelihood p(𝐱1:τ+1,𝒪τ+1:T|𝐚1:τ). This objective represents both the likelihood of the observed data from the past τ steps, as well as the optimality of the agent’s actions for future steps. We factorize our variational distribution into a product of recognition terms q(𝐳1|𝐱1) and q(𝐳t+1|𝐱t+1,𝐳t,𝐚t), dynamics terms p(𝐳t+1|𝐳t,𝐚t), and policy terms π(𝐚t|𝐳t):

q(𝐳1:T,𝐚τ+1:T|𝐱1:τ+1,𝐚1:τ)=q(𝐳1|𝐱1)t=1τq(𝐳t+1|𝐱t+1,𝐳t,𝐚t)t=τ+1T-1p(𝐳t+1|𝐳t,𝐚t)t=τ+1Tπ(𝐚t|𝐳t). (7)

The variational distribution uses the dynamics for future time steps to prevent the agent from controlling the transitions and from choosing optimistic actions (levine2018reinforcement). The posterior over the actions represents the agent’s policy π. Although this derivation uses a policy that is conditioned on the latent state, our algorithm, which will be described in the next section, learns a parametric policy that is directly conditioned on observations and actions. This approximation allows us to directly execute the policy without having to perform inference on the latent state at run time.

We use the posterior from Equation (missing) to obtain the evidence lower bound (ELBO) of the marginal likelihood,

logp(𝐱1:τ+1,𝒪τ+1:T|𝐚1:τ)𝔼(𝐳1:T,𝐚τ+1:T)q[t=1τ+1logp(𝐱t|𝐳t)-DKL(q(𝐳1|𝐱1)p(𝐳1))-t=1τDKL(q(𝐳t+1|𝐱t+1,𝐳t,𝐚t)p(𝐳t+1|𝐳t,𝐚t))+t=τ+1T(r(𝐳t,𝐚t)+logp(𝐚t)-logπ(𝐚t|𝐳t))], (8)

where p(𝐚t) is the action prior. The full derivation of the ELBO is given in Appendix A. This derivation assumes that the reward function, which determines p(𝒪t|𝐳t,𝐚t), is known. However, in many RL problems, this is not the case. In that situation, we can simply append the reward to the observation, and learn the reward along with p(𝐱t|𝐳t). This requires no modification to our method other than changing the observation space, and we use this approach in all of our experiments. We do this to learn latent representations that are more relevant to the task, but we do not use predictions from it. Instead, the RL objective uses rewards from the agent’s experience, as in model-free RL.

5 Stochastic Latent Actor Critic

We now describe our stochastic latent actor critic (SLAC) algorithm, which approximately maximizes the ELBO using function approximators to model the prior and posterior distributions. The ELBO objective in Equation (missing) can be split into a model objective and a maximum entropy RL objective. The model objective can directly be optimized, while the maximum entropy RL objective can be solved via message passing. We can learn Q-functions for the messages, and then we can rewrite the RL objective to express it in terms of these messages. Additional details of the derivation of the SLAC objectives are given in Appendix A.

Latent Variable Model: The first part of the ELBO corresponds to training the latent variable model to maximize the likelihood of the observations, analogous to the ELBO in Equation (missing) for the sequential latent variable model. The distributions of the latent variable model are diagonal Gaussian distributions, where the means and variances are outputs of neural networks. The distribution parameters ψ of this model are optimized to maximize the first part of the ELBO. The model loss is

JM(ψ)=𝔼(𝐱1:τ+1,𝐚1:τ)𝒟[𝔼𝐳1:τ+1qψ[t=1τ+1logpψ(𝐱t|𝐳t)-DKL(qψ(𝐳1|𝐱1)pψ(𝐳1))-t=1τDKL(qψ(𝐳t+1|𝐱t+1,𝐳t,𝐚t)pψ(𝐳t+1|𝐳t,𝐚t))]]. (9)

We use the reparameterization trick to sample from the filtering distribution qψ(𝐳1:τ+1|𝐱1:τ+1,𝐚1:τ).

Critic and Actor: The second part of the ELBO corresponds to the maximum entropy RL objective. As in the fully observable case from Section 3.1 and as described by levine2018reinforcement, this optimization can be solved via message passing of soft Q-values, except that we use the latent states 𝐳 rather than the true states 𝐬. For continuous state and action spaces, this message passing is approximated by minimizing the soft Bellman residual, which we use to train our soft Q-function parameters θ,

JQ(θ)=𝔼(𝐱1:τ+1,𝐚1:τ,rτ)𝒟[𝔼𝐳1:τ+1qψ[12(Qθ(𝐳τ,𝐚τ)-(rτ+γ𝔼𝐚τ+1πϕ[Qθ¯(𝐳τ+1,𝐚τ+1)-αlogπϕ(𝐚τ+1|𝐱1:τ+1,𝐚1:τ)]))2]], (10)

where θ¯ are delayed parameters, obtained as exponential moving averages of θ. Notice that the latents 𝐳τ and 𝐳τ+1, which are used in the Bellman backup, are sampled from the same joint, i.e. 𝐳τ+1qψ(𝐳τ+1|𝐱τ+1,𝐳τ,𝐚τ). The RL objective, which corresponds to the second part of the ELBO, can be rewritten in terms of the soft Q-function. The policy parameters ϕ are optimized to maximize this objective, analogously to soft actor-critic (haarnoja2018soft). The policy loss is then

Jπ(ϕ)=𝔼(𝐱1:τ+1,𝐚1:τ)𝒟[𝔼𝐳1:τ+1qψ[𝔼𝐚τ+1πϕ[αlogπϕ(𝐚τ+1|𝐱1:τ+1,𝐚1:τ)-Qθ(𝐳τ+1,𝐚τ+1)]]]. (11)

We assume a uniform action prior, so p(𝐚t) is a constant term that we omit from the policy loss. We use the reparameterization trick to sample from the policy, and the policy loss only uses the last sample 𝐳τ+1 of the sequence for the critic. Although the policy used in our derivation is conditioned in the latent state, our learned parametric policy is conditioned directly on the past observations and actions, so that the learned policy can be executed at run time without requiring inference of the latent state. Finally, we note that for the expectation over latent states in the Bellman residual in Equation (missing), rather than sampling latent states 𝐳𝒵, we sample latent states from the filtering distribution qψ(𝐳1:τ+1|𝐱1:τ+1,𝐚1:τ). This design choice allows us to minimize the critic loss for samples that are most relevant for Q, while also allowing the critic loss to use the Q-function in the same way as implied by the policy loss in Equation (missing).

SLAC is outlined in Section 5. The actor-critic component follows prior work, with automatic tuning of the temperature α and two Q-functions to mitigate underestimation (fujimoto2018addressing; haarnoja2018soft; haarnoja2018applications). SLAC can be viewed as a variant of SAC (haarnoja2018soft) where the critic is trained on the stochastic latent state of our sequential latent variable model. The backup for the critic is performed on a tuple (𝐳τ,𝐚τ,rτ,𝐳τ+1), sampled from the posterior q(𝐳τ+1,𝐳τ|𝐱1:τ+1,𝐚1:τ). The critic can, in principle, take advantage of the perfect knowledge of the state 𝐳t, which makes learning easier. However, the parametric policy does not have access to 𝐳t, and must make decisions based on a history of observations and actions. SLAC is not a model-based algorithm, in that in does not use the model for prediction, but we see in our experiments that SLAC can achieve similar sample efficiency as a model-based algorithm.

{algorithm}

Stochastic Latent Actor-Critic (SLAC) \[email protected]@algorithmic \RequireE, ψ, θ1, θ2, ϕ \CommentEnvironment and initial parameters for model, actor, and critic \State𝐱1Ereset() \CommentSample initial observation from the environment \State𝒟(𝐱1) \CommentInitialize replay buffer with initial observation \Foreach iteration \Foreach environment step \State𝐚tπϕ(𝐚t|𝐱1:t,𝐚1:t-1) \CommentSample action from the policy \Statert,𝐱t+1Estep(𝐚t) \CommentSample transition from the environment \State𝒟𝒟(𝐚t,rt,𝐱t+1) \CommentStore the transition in the replay buffer \EndFor\Foreach gradient step \Stateψψ-λMψJM(ψ) \CommentUpdate model weights \Stateθiθi-λQθiJQ(θi) for i{1,2} \CommentUpdate the Q-function weights \Stateϕϕ-λπϕJπ(ϕ) \CommentUpdate policy weights \Stateθ¯iνθi+(1-ν)θ¯i for i{1,2} \CommentUpdate target critic network weights \EndFor\EndFor

6 Latent Variable Model

We briefly summarize our full model architecture here, with full details in Appendix B. Motivated by the recent success of autoregressive latent variables in VAEs (razavi2019vqvae2; maaloe2019biva), we factorize the latent variable 𝐳t into two stochastic layers, 𝐳t1 and 𝐳t2, as shown in Figure 2. This factorization results in latent distributions that are more expressive, and it allows for some parts of the prior and posterior distributions to be shared. We found this design to produce high quality reconstructions and samples, and utilize it in all of our experiments. The generative model p and the inference model q are given by

pψ(𝐳1) =pψ(𝐳12|𝐳11)p(𝐳11),
pψ(𝐳t+1|𝐳t,𝐚t) =pψ(𝐳t+12|𝐳t+11,𝐳t2,𝐚t)pψ(𝐳t+11|𝐳t2,𝐚t),
qψ(𝐳1|𝐱1) =pψ(𝐳12|𝐳11)qψ(𝐳11|𝐱1),
qψ(𝐳t+1|𝐱t+1,𝐳t,𝐚t) =pψ(𝐳t+12|𝐳t+11,𝐳t2,𝐚t)qψ(𝐳t+11|𝐱t+1,𝐳t2,𝐚t).
\includestandalone

figures/lvm_pgm

Figure 2: Diagram of our full model. Solid arrows show the generative model, dashed arrows show the inference model. Rewards are not shown for clarity.

Note that we choose the variational distribution q over 𝐳t2 to be the same as the model p. Thus, the KL divergence in JM simplifies to the divergence between q and p over 𝐳t1. We use a multivariate standard normal distribution for p(𝐳11), since it is not conditioned on any variables, i.e. 𝐳11𝒩(𝟎,𝑰). The conditional distributions of our model are diagonal Gaussian, with means and variances given by neural networks. Unlike models from prior work (hafner2019learning; buesing2018learning; doerr2018probabilistic), which have deterministic and stochastic paths and use recurrent neural networks, ours is fully stochastic, i.e. our latent state is a Markovian latent random variable formed by the concatenation of 𝐳t1 and 𝐳t2. Further details are discussed in Appendix B.

7 Experimental Evaluation

We evaluate SLAC on numerous image-based continuous control tasks from both the DeepMind Control Suite (tassa2018deepmind) and OpenAI Gym (brockman2016openai), as illustrated in Figure 3. Full details of SLAC’s network architecture are described in Appendix B. Aside from the value of action repeats (i.e. control frequency) for the tasks, we kept all of SLAC’s hyperparameters constant across all tasks in all domains. Training and evaluation details are given in Appendix C, and image samples from our model for all tasks are shown in Appendix E. Additionally, visualizations of our results and code are available on the project website.22 2 https://alexlee-gk.github.io/slac/

Figure 3: Example image observations for our continuous control benchmark tasks: DeepMind Control’s cheetah run, walker walk, ball-in-cup catch, and finger spin, and OpenAI Gym’s half cheetah, walker, hopper, and ant (left to right). Images are rendered at a resolution of 64×64 pixels.

7.1 Comparative Evaluation on Continuous Control Benchmark Tasks

To provide a comparative evaluation against prior methods, we evaluate SLAC on four tasks (cheetah run, walker walk, ball-in-cup catch, finger spin) from the DeepMind Control Suite (tassa2018deepmind), and four tasks (cheetah, walker, ant, hopper) from OpenAI Gym (brockman2016openai). Note that the Gym tasks are typically used with low-dimensional state observations, while we evaluate on them with raw image observations. We compare our method to the following state-of-the-art model-based and model-free algorithms:

SAC (haarnoja2018soft): This is an off-policy actor-critic algorithm, which represents a comparison to state-of-the-art model-free learning. We include experiments showing the performance of SAC based on true state (as an upper bound on performance) as well as directly from raw images.

D4PG (barth2018distributed): This is also an off-policy actor-critic algorithm, learning directly from raw images. The results reported in the plots below are the performance after 108 training steps, as stated in the benchmarks from (tassa2018deepmind).

MPO (abdolmaleki2018mpo; abdolmaleki2018mpo_b): This is an off-policy actor-critic algorithm that performs an expectation maximization form of policy iteration, learning directly from raw images.

PlaNet (hafner2019learning): This is a model-based RL method for learning from images, which uses a partially stochastic sequential latent variable model, but without explicit policy learning. Instead, the model is used for planning with model predictive control (MPC), where each plan is optimized with the cross entropy method (CEM).

DVRL (igl2018deep): This is an on-policy model-free RL algorithm that also trains a partially stochastic latent-variable POMDP model. DVRL uses the full belief over the latent state as input into both the actor and critic, as opposed to our method, which trains the critic with the latent state and the actor with a history of actions and observations.

Figure 4: Experiments on the DeepMind Control Suite from images (unless otherwise labeled as "state"). SLAC (ours) converges to similar or better final performance than the other methods, while almost always achieving reward as high as the upper bound SAC baseline that learns from true state. Note that for these experiments, 1000 environments steps corresponds to 1 episode.

Figure 5: Experiments on the OpenAI Gym benchmark tasks from images. SLAC (ours) converges to higher performance than both PlaNet and SAC on all four of these tasks. The number of environments steps in each episode is variable, depending on the termination.

Our experiments on the DeepMind Control Suite in Figure 5 show that the sample efficiency of SLAC is comparable or better than both model-based and model-free alternatives. This indicates that overcoming the representation learning bottleneck, coupled with efficient off-policy RL, provides for fast learning similar to model-based methods, while attaining final performance comparable to fully model-free techniques that learn from state. SLAC also substantially outperforms DVRL. This difference can be explained in part by the use of an efficient off-policy RL algorithm, which can better take advantage of the learned representation.

We also evaluate SLAC on continuous control benchmark tasks from OpenAI Gym in Figure 5. We notice that these tasks are much more challenging than the DeepMind Control Suite tasks, because the rewards are not as shaped and not bounded between 0 and 1, the dynamics are different, and the episodes terminate on failure (e.g., when the hopper or walker falls over). PlaNet is unable to solve the last three tasks, while for the cheetah task, it learns a suboptimal policy that involves flipping the cheetah over and pushing forward while on its back. To better understand the performance of fixed-horizon MPC on these tasks, we also evaluated with the ground truth dynamics (i.e., the true simulator), and found that even in this case, MPC did not achieve good final performance, suggesting that infinite horizon policy optimization, of the sort performed by SLAC and model-free algorithms, is important to attain good results on these tasks.

Our experiments show that SLAC successfully learns complex continuous control benchmark tasks from raw image inputs. On the DeepMind Control Suite, SLAC exceeds the performance of PlaNet on three of the tasks, and matches its performance on the walker task. However, on the harder image-based OpenAI Gym tasks, SLAC outperforms PlaNet by a large margin. In both domains, SLAC substantially outperforms all prior model-free methods. We note that the prior methods that we tested generally performed poorly on the image-based OpenAI Gym tasks, despite considerable hyperparameter tuning.

7.2 Evaluating the Latent Variable Model

Figure 6: Comparison of different design choices for the latent variable model.

We next study the tradeoffs between different design choices for the latent variable model. We compare our fully stochastic model, as described in Section 6, to a standard non-sequential VAE model (kingma2013auto), which has been used in multiple prior works for representation learning in RL (higgins2017darla; ha2018world; nair2018visual), the partially stochastic model used by PlaNet (hafner2019learning), as well as three variants of our model: a simple filtering model that does not factorize the latent variable into two layers of stochastic units, a fully deterministic model that removes all stochasticity from the hidden state dynamics, and a partially stochastic model that has stochastic transitions for 𝐳t1 and deterministic transitions for 𝐳t2, similar to the PlaNet model, but with our architecture. Both the fully deterministic and partially stochastic models use the same architecture as our fully stochastic model, including the same two-level factorization of the latent variable. In all cases, we use the RL framework of SLAC and only vary the choice of model for representation learning. As shown in the comparison in Figure 6, our fully stochastic model outperforms prior models as well as the deterministic and simple variants of our own model. The partially stochastic variant of our model matches the performance of our fully stochastic model but, contrary to the conclusions in prior work (hafner2019learning; buesing2018learning), the fully stochastic model performs on par, while retaining the appealing interpretation of a stochastic state space model. We hypothesize that these prior works benefit from the deterministic paths (realized as an LSTM or GRU) because they use multi-step samples from the prior. In contrast, our method uses samples from the posterior, which are conditioned on same-step observations, and thus these latent samples are less sensitive to the propagation of the latent states through time.

7.3 Qualitative Predictions from the Latent Variable Model

We show example image samples from our learned sequential latent variable model for the cheetah task in Figure 7, and we include the other tasks in Appendix E. Samples from the posterior show the images 𝐱t as constructed by the decoder pψ(𝐱t|𝐳t), using a sequence of latents 𝐳t that are encoded and sampled from the posteriors, qψ(𝐳1|𝐱1) and qψ(𝐳t+1|𝐱t+1,𝐳t,𝐚t). Samples from the prior, on the other hand, use a sequence of latents where 𝐳1 is sampled from p(𝐳1) and all remaining latents 𝐳t are from the propagation of the previous latent state through the latent dynamics pψ(𝐳t+1|𝐳t,𝐚t). Note that these prior samples do not use any image frames as inputs, and thus they do not correspond to any ground truth sequence. We also show samples from the conditional prior, which is conditioned on the first image from the true sequence: for this, the sampling procedure is the same as the prior, except that 𝐳1 is encoded and sampled from the posterior qψ(𝐳1|𝐱1), rather than being sampled from p(𝐳1). We notice that the generated images samples can be sharper and more realistic by using a smaller variance for pψ(𝐱t|𝐳t) when training the model, but at the expense of a representation that leads to lower returns. Finally, note that we do not actually use the samples from the prior for training.

\makecell

Cheetah run

Truth \adjustboxvalign=c,margin=0 1pt
Sample \adjustboxvalign=c,margin=0 1pt
Prior Sample \adjustboxvalign=c,margin=0 1pt
Sample \adjustboxvalign=c,margin=0 1pt
Figure 7: Example image sequence seen for the cheetah task (first row), corresponding posterior sample (reconstruction) from our model (second row), and generated prediction from the generative model (last two rows). The second to last row is conditioned on the first frame (i.e., the posterior model is used for the first time step while the prior model is used for all subsequent steps), whereas the last row is not conditioned on any ground truth images. Note that all of these sampled sequences are conditioned on the same action sequence, and that our model produces highly realistic samples, even when predicting via the generative model.

8 Discussion

We presented SLAC, an efficient RL algorithm for learning from high-dimensional image inputs that combines efficient off-policy model-free RL with representation learning via a sequential stochastic state space model. Through representation learning in conjunction with effective task learning in the learned latent space, our method achieves improved sample efficiency and final task performance as compared to both prior model-based and model-free RL methods.

While our current SLAC algorithm is fully model-free, in that predictions from the model are not utilized to speed up training, a natural extension of our approach would be to use the model predictions themselves to generate synthetic samples. Incorporating this additional synthetic model-based data into a mixed model-based/model-free method could further improve sample efficiency and performance. More broadly, the use of explicit representation learning with RL has the potential to not only accelerate training time and increase the complexity of achievable tasks, but also enable reuse and transfer of our learned representation across tasks.

Acknowledgments

We thank Marvin Zhang, Abhishek Gupta, and Chelsea Finn for useful discussions and feedback, Danijar Hafner for providing timely assistance with PlaNet, and Maximilian Igl for providing timely assistance with DVRL. We also thank Deirdre Quillen, Tianhe Yu, and Chelsea Finn for providing us with their suite of Sawyer manipulation tasks. This research was supported by the National Science Foundation through IIS-1651843 and IIS-1700697, as well as ARL DCIST CRA W911NF-17-2-0181 and the Office of Naval Research. Compute support was provided by NVIDIA.

References

Appendix A Derivation of the Evidence Lower Bound and SLAC Objectives

In this appendix, we discuss how the SLAC objectives can be derived from applying a variational inference scheme to the control as inference framework for reinforcement learning (levine2018reinforcement) . In this framework, the problem of finding the optimal policy is cast as an inference problem, conditioned on the evidence that the agent is behaving optimally. While levine2018reinforcement derives this in the fully observed case, we present a derivation in the POMDP setting.

We aim to maximize the marginal likelihood p(𝐱1:τ+1,𝒪τ+1:T|𝐚1:τ), where τ is the number of steps that the agent has already taken. This likelihood reflects that the agent cannot modify the past τ actions and they might have not been optimal, but it can choose the future actions up to the end of the episode, such that the chosen future actions are optimal. Notice that unlike the standard control as inference framework, in this work we not only maximize the likelihood of the optimality variables but also the likelihood of the observations, which provides additional supervision for the latent representation. This does not come up in the MDP setting since the state representation is fixed and learning a dynamics model of the state would not change the model-free equations derived from the maximum entropy RL objective.

For reference, we restate the factorization of our variational distribution:

q(𝐳1:T,𝐚τ+1:T|𝐱1:τ+1,𝐚1:τ)=q(𝐳1|𝐱1)t=1τq(𝐳t+1|𝐱t+1,𝐳t,𝐚t)t=τ+1T-1p(𝐳t+1|𝐳t,𝐚t)t=τ+1Tπ(𝐚t|𝐳t). (12)

As discussed by levine2018reinforcement, the agent does not have control over the stochastic dynamics, so we use the dynamics p(𝐳t+1|𝐳t,𝐚t) for tτ+1 in the variational distribution in order to prevent the agent from choosing optimistic actions.

The joint likelihood is

p(𝐱1:τ+1,𝒪τ+1:T,𝐳1:T,𝐚τ+1:T|𝐚1:τ)=p(𝐳1)t=1T-1p(𝐳t+1|𝐳t,𝐚t)t=1τ+1p(𝐱t|𝐳t)t=τ+1Tp(𝒪t|𝐳t,𝐚t)t=τ+1Tp(𝐚t). (13)

We use the posterior from Equation (missing), the likelihood from Equation (missing), and Jensen’s inequality to obtain the ELBO of the marginal likelihood,

logp(𝐱1:τ+1,𝒪τ+1:T|𝐚1:τ)
=log𝐳1:T𝐚τ+1:Tp(𝐱1:τ+1,𝒪τ+1:T,𝐳1:T,𝐚τ+1:T|𝐚1:τ)d𝐳1:Td𝐚τ+1:T (14)
𝔼(𝐳1:T,𝐚τ+1:T)q[logp(𝐱1:τ+1,𝒪τ+1:T,𝐳1:T,𝐚τ+1:T|𝐚1:τ)-logq(𝐳1:T,𝐚τ+1:T|𝐱1:τ+1,𝐚1:τ)] (15)
=𝔼(𝐳1:T,𝐚τ+1:T)q[t=1τ+1logp(𝐱t|𝐳t)-DKL(q(𝐳1|𝐱1)p(𝐳1))-t=1τDKL(q(𝐳t+1|𝐱t+1,𝐳t,𝐚t)p(𝐳t+1|𝐳t,𝐚t))+t=τ+1T(r(𝐳t,𝐚t)+logp(𝐚t)-logπ(𝐚t|𝐳t))]. (16)

We are interested in the likelihood of optimal trajectories, so we use 𝒪t=1 for tτ+1, and its distribution is given by p(𝒪t=1|𝐳t,𝐚t)=exp(r(𝐳t,𝐚t)) in the control as inference framework. Notice that the dynamics terms logp(𝐳t+1|𝐳t,𝐚t) for tτ+1 from the posterior and the prior cancel each other out in the ELBO.

The first part of the ELBO corresponds to the model objective. When using the parametric function approximators, the negative of it corresponds directly to the model loss in Equation (missing).

The second part of the ELBO corresponds to the maximum entropy RL objective. We assume a uniform action prior, so the logp(𝐚t) term is a constant term that can be omitted when optimizing this objective. We use message passing to optimize this objective, with messages defined as

Q(𝐳t,𝐚t) =r(𝐳t,𝐚t)+𝔼𝐳t+1q(|𝐱t+1,𝐳t,𝐚t)[V(𝐳t+1)] (17)
V(𝐳t) =log𝐚texp(Q(𝐳t,𝐚t))d𝐚t. (18)

Then, the maximum entropy RL objective can be expressed in terms of the messages as

𝔼(𝐳τ+1:T,𝐚τ+1:T)q[t=τ+1T(r(𝐳t,𝐚t)-logπ(𝐚t|𝐳t))]
=𝔼𝐳τ+1q(|𝐱τ+1,𝐳τ,𝐚τ)[𝔼𝐚τ+1π(|𝐳τ+1)[Q(𝐳τ+1,𝐚τ+1)-logπ(𝐚τ+1|𝐳τ+1)]] (19)
=𝔼𝐳τ+1q(|𝐱τ+1,𝐳τ,𝐚τ)[-DKL(π(𝐚τ+1|𝐳τ+1)exp(Q(𝐳τ+1,𝐚τ+1))exp(V(𝐳τ+1)))+V(𝐳τ+1)], (20)

where the first equality is obtained from dynamic programming (see levine2018reinforcement for details), the second equality holds from the definition of KL divergence, and exp(V(𝐳t)) is the normalization factor for exp(Q(𝐳t,𝐚t)) with respect to 𝐚t. Since the KL divergence term is minimized when its two arguments represent the same distribution, the optimal policy is given by

π(𝐚t|𝐳t)=exp(Q(𝐳t,𝐚t)-V(𝐳t)). (21)

Noting that the KL divergence term is zero for the optimal action, the equality from Equation (missing) can be used in Equation (missing) to obtain

Q(𝐳t,𝐚t)=r(𝐳t,𝐚t)+𝔼𝐳t+1q(|𝐱t+1,𝐳t,𝐚t)[𝔼𝐚τ+1π(|𝐳τ+1)[Q(𝐳t+1,𝐚t+1)-logπ(𝐚t+1|𝐳t+1)]]. (22)

This equation corresponds to the standard Bellman backup with a soft maximization for the value function.

As mentioned in Section 5, our algorithm conditions the parametric policy in the history of observations and actions, which allows us to directly execute the policy without having to perform inference on the latent state at run time. When using the parametric function approximators, the negative of the maximum entropy RL objective, written as in Equation (missing), corresponds to the policy loss in Equation (missing). Lastly, the Bellman backup of Equation (missing) corresponds to the Bellman residual in Equation (missing) when approximated by a regression objective.

We showed that the SLAC objectives can be derived from applying variational inference in the control as inference framework in the POMDP setting. This leads to the joint likelihood of the past observations and future optimality variables, which we aim to optimize by maximizing the ELBO of the log-likelihood. We decompose the ELBO into the model objective and the maximum entropy RL objective. We express the latter in terms of messages of Q-functions, which in turn are learned by minimizing the Bellman residual. These objectives lead to the model, policy, and critic losses.

Appendix B Network Architectures

Recall that our full sequential latent variable model has two layers of latent variables, which we denote as 𝐳t1 and 𝐳t2. We found this design to provide a good balance between ease of training and expressivity, producing good reconstructions and generations and, crucially, providing good representations for reinforcement learning. For reference, we reproduce the model diagram from the

\includestandalone

figures/lvm_pgm

Figure 8: Diagram of our full model, reproduced from the main paper. Solid arrows show the generative model, dashed arrows show the inference model. Rewards are not shown for clarity.

main paper in Figure 8. Note that this diagram represents the Bayes net corresponding to our full model. However, since all of the latent variables are stochastic, this visualization also presents the design of the computation graph. Inference over the latent variables is performed using amortized variational inference, with all training done via reparameterization. Hence, the computation graph can be deduced from the diagram by treating all solid arrows as part of the generative model and all dashed arrows as part of approximate posterior. The generative model consists of the following probability distributions, as described in the main paper:

𝐳11 p(𝐳11) 𝐳t+11 pψ(𝐳t+11|𝐳t2,𝐚t)
𝐳12 pψ(𝐳12|𝐳11) 𝐳t+12 pψ(𝐳t+12|𝐳t+11,𝐳t2,𝐚t)
𝐱t pψ(𝐱t|𝐳t1,𝐳t2) rt pψ(rt|𝐳t1,𝐳t2,𝐚t,𝐳t+11,𝐳t+12).

The initial distribution p(𝐳11) is a multivariate standard normal distribution 𝒩(𝟎,𝑰). All of the other distributions are conditional and parameterized by neural networks with parameters ψ. The networks for pψ(𝐳12|𝐳11), pψ(𝐳t+11|𝐳t2,𝐚t), pψ(𝐳t+12|𝐳t+11,𝐳t2,𝐚t), and pψ(rt|𝐳t1,𝐳t2,𝐚t,𝐳t+11,𝐳t+12) consist of two fully connected layers, each with 256 hidden units, and a Gaussian output layer. The Gaussian layer is defined such that it outputs a multivariate normal distribution with diagonal variance, where the mean is the output of a linear layer and the diagonal standard deviation is the output of a fully connected layer with softplus non-linearity. The observation model pψ(𝐱t|𝐳t1,𝐳t2) consists of 5 transposed convolutional layers (256 4×4, 128 3×3, 64 3×3, 32 3×3, and 3 5×5 filters, respectively, stride 2 each, except for the first layer). The output variance for each image pixel is fixed to 0.1.

The variational distribution q, also referred to as the inference model or the posterior, is represented by the following factorization:

𝐳11 qψ(𝐳11|𝐱1) 𝐳t+11 qψ(𝐳t+11|𝐱t+1,𝐳t2,𝐚t)
𝐳12 pψ(𝐳12|𝐳11) 𝐳t+12 pψ(𝐳t+12|𝐳t+11,𝐳t2,𝐚t).

Note that the variational distribution over 𝐳12 and 𝐳t+12 is intentionally chosen to exactly match the generative model p, such that this term does not appear in the KL-divergence within the ELBO, and a separate variational distribution is only learned over 𝐳11 and 𝐳t+11. This intentional design decision simplifies the inference process. The networks representing the distributions qψ(𝐳11|𝐱1) and qψ(𝐳t+11|𝐱t+1,𝐳t2,𝐚t) both consist of 5 convolutional layers (32 5×5, 64 3×3, 128 3×3, 256 3×3, and 256 4×4 filters, respectively, stride 2 each, except for the last layer), 2 fully connected layers (256 units each), and a Gaussian output layer. The parameters of the convolution layers are shared among both distributions.

The latent variables have 32 and 256 dimensions, respectively, i.e. 𝐳t132 and 𝐳t2256. For the image observations, 𝐱t[0,1]64×64×3. All the layers, except for the output layers, use leaky ReLU non-linearities. Note that there are no deterministic recurrent connections in the network—all networks are feedforward, and the temporal dependencies all flow through the stochastic units 𝐳t1 and 𝐳t2.

For the reinforcement learning process, we use a critic network Qθ consisting of 2 fully connected layers (256 units each) and a linear output layer. The actor network πϕ consists of 5 convolutional layers, 2 fully connected layers (256 units each), a Gaussian layer, and a tanh bijector, which constrains the actions to be in the bounded action space of [-1,1]. The convolutional layers are the same as the ones from the latent variable model, but the parameters of these layers are not updated by the actor objective. The same exact network architecture is used for every one of the experiments in the paper.

Appendix C Training and Evaluation Details

The control portion of our algorithm uses the same hyperparameters as SAC (haarnoja2018soft), except for a smaller replay buffer size of 100000 environment steps (instead of a million) due to the high memory usage of image observations. All of the parameters are trained with the Adam optimizer (kingma2015adam), and we perform one gradient step per environment step. The Q-function and policy parameters are trained with a learning rate of 0.0003 and a batch size of 256. The model parameters are trained with a learning rate of 0.0001 and a batch size of 32. We use sequences of length τ=8 for all the tasks. Note that the sequence length can be less than τ for the first t steps (t<τ) of each episode.

We use action repeats for all the methods, except for D4PG for which we use the reported results from prior work (tassa2018deepmind). The number of environment steps reported in our plots correspond to the unmodified steps of the benchmarks. Note that the methods that use action repeats only use a fraction of the environment steps reported in our plots. For example, 3 million environment steps of the cheetah task correspond to 750000 samples when using an action repeat of 4. The action repeats used in our experiments are given in Table 1.

Unlike in prior work (haarnoja2018soft; haarnoja2018applications), we use the same stochastic policy as both the behavioral and evaluation policy since we found the deterministic greedy policy to be comparable or worse than the stochastic policy.

Benchmark Task \makecellAction
repeat \makecellOriginal control
time step \makecellEffective control
time step
DeepMind Control Suite cheetah run 4 0.01 0.04
walker walk 2 0.025 0.05
ball-in-cup catch 4 0.02 0.08
finger spin 2 0.02 0.04

\aboverulesep3.0pt1-5\cdashline.51\cdashline.5

\cdashlineplus1fil minus1fil

\belowrulesep

OpenAI Gym HalfCheetah-v2 1 0.05 0.05
Walker2d-v2 4 0.008 0.032
Hopper-v2 2 0.008 0.016
Ant-v2 4 0.05 0.2
Table 1: Action repeats and the corresponding agent’s control time step used in our experiments.

Appendix D Additional Experiments on Simulated Robotic Manipulation Tasks

Beyond standard benchmark tasks, we also aim to illustrate the flexibility of our method by demonstrating it on a variety of image-based robotic manipulation skills. The reward functions for these tasks are the following:

Sawyer Door Open

r=-|θdoor-θdesired| (23)

Sawyer Drawer Close

dhandToHandle =||xyzhand-xyzhandle||22
ddrawerToGoal =|xdrawer-xgoal|
r =-dhandToHandle-ddrawerToGoal (24)

Sawyer Pick-up

dtoObj =||xyzhand-xyzobj||22
rreach =0.25(1-tanh(10.0dtoObj))
rlift ={1if object is lifted0else
r =max(rreach,rlift) (25)

In Figure 9, we show illustrations of SLAC’s execution of these manipulation tasks of using a simulated Sawyer robotic arm to push open a door, close a drawer, and reach out and pick up an object. Our method is able to learn these contact-rich manipulation tasks from raw images, succeeding even when the object of interest actually occupies only a small portion of the image.

Figure 9: Qualitative results of SLAC learning to perform manipulation tasks such as opening a door, closing a drawer, and picking up a block with the Sawyer robot. SLAC is able to learn these contact-rich manipulation tasks from raw images, succeeding even when the object of interest occupies only a small portion of the image.

In our next set of manipulation experiments, we use the 9-DoF 3-fingered DClaw robot to rotate a valve (zhu2019dexterous) from various starting positions to various desired goal locations, where the goal is illustrated as a green dot in the image. In all of our experiments, we allow the starting position of the valve to be selected randomly between [-π,π], and we test three different settings for the goal location (see Figure 10). First, we prescribe the goal position to always be fixed. In this task setting, we see that SLAC, SAC from images, and SAC from state all perform similarly in terms of both sample efficiency as well as final performance. However, when we allow the goal location to be selected randomly from a set of 3 options {-π2,0,π2}, we see that SLAC and SAC from images actually outperform SAC from state. This interesting result can perhaps be explained by the fact that, when learning from state, the goal is specified with just a single number within the state vector, rather than with the redundancy of numerous green pixels in the image. Finally, when we allow the goal position of the valve to be selected randomly from [-π2,π2], we see that SLAC’s explicit representation learning improves substantially over image-based SAC, performing comparable to the oracle baseline that receives the true state observation.

Figure 10: Experiments on the DClaw task (a) of turning a valve to a desired location, as shown by a green dot, including comparisons for achieving a (a) fixed goal, (b) three possible goals, and (c) random goals.

Appendix E Additional Predictions from the Latent Variable Model

We show additional samples from our model in Figure 11, Figure 12, and Figure 13. Samples from the posterior show the images 𝐱t as constructed by the decoder pψ(𝐱t|𝐳t), using a sequence of latents 𝐳t that are encoded and sampled from the posteriors, qψ(𝐳1|𝐱1) and qψ(𝐳t+1|𝐱t+1,𝐳t,𝐚t). Samples from the prior, on the other hand, use a sequence of latents where 𝐳1 is sampled from p(𝐳1) and all remaining latents 𝐳t are from the propagation of the previous latent state through the latent dynamics pψ(𝐳t+1|𝐳t,𝐚t). These samples do not use any image frames as inputs, and thus they do not correspond to any ground truth sequence. We also show samples from the conditional prior, which is conditioned on the first image from the true sequence: for this, the sampling procedure is the same as the prior, except that 𝐳1 is encoded and sampled from the posterior qψ(𝐳1|𝐱1), rather than being sampled from p(𝐳1).

\makecell

Walker walk

Truth \adjustboxvalign=c,margin=0 1pt
Sample \adjustboxvalign=c,margin=0 1pt
Prior Sample \adjustboxvalign=c,margin=0 1pt
Sample \adjustboxvalign=c,margin=0 1pt
\makecell

Ball-in-cup catch

Truth \adjustboxvalign=c,margin=0 1pt
Sample \adjustboxvalign=c,margin=0 1pt
Prior Sample \adjustboxvalign=c,margin=0 1pt
Sample \adjustboxvalign=c,margin=0 1pt
\makecell

Finger spin

Truth \adjustboxvalign=c,margin=0 1pt
Sample \adjustboxvalign=c,margin=0 1pt
Prior Sample \adjustboxvalign=c,margin=0 1pt
Sample \adjustboxvalign=c,margin=0 1pt
Figure 11: Example image sequences, along with generated image samples, for three of the DM Control tasks that we used in our experiments. See Figure 7 for more details and for image samples from the cheetah task.
\makecell

HalfCheetah-v2

Truth \adjustboxvalign=c,margin=0 1pt
Sample \adjustboxvalign=c,margin=0 1pt
Prior Sample \adjustboxvalign=c,margin=0 1pt
Sample \adjustboxvalign=c,margin=0 1pt
\makecell

Walker2d-v2

Truth \adjustboxvalign=c,margin=0 1pt
Sample \adjustboxvalign=c,margin=0 1pt
Prior Sample \adjustboxvalign=c,margin=0 1pt
Sample \adjustboxvalign=c,margin=0 1pt
\makecell

Hopper-v2

Truth \adjustboxvalign=c,margin=0 1pt
Sample \adjustboxvalign=c,margin=0 1pt
Prior Sample \adjustboxvalign=c,margin=0 1pt
Sample \adjustboxvalign=c,margin=0 1pt
\makecell

Ant-v2

Truth \adjustboxvalign=c,margin=0 1pt
Sample \adjustboxvalign=c,margin=0 1pt
Prior Sample \adjustboxvalign=c,margin=0 1pt
Sample \adjustboxvalign=c,margin=0 1pt
Figure 12: Example image sequences, along with generated image samples, for the four OpenAI Gym tasks that we used in our experiments.
\makecell

Sawyer Door Open

Truth \adjustboxvalign=c,margin=0 1pt
Sample \adjustboxvalign=c,margin=0 1pt
Prior Sample \adjustboxvalign=c,margin=0 1pt
Sample \adjustboxvalign=c,margin=0 1pt
\makecell

Sawyer Drawer Close

Truth \adjustboxvalign=c,margin=0 1pt
Sample \adjustboxvalign=c,margin=0 1pt
Prior Sample \adjustboxvalign=c,margin=0 1pt
Sample \adjustboxvalign=c,margin=0 1pt
\makecell

Sawyer Pick-up

Truth \adjustboxvalign=c,margin=0 1pt
Sample \adjustboxvalign=c,margin=0 1pt
Prior Sample \adjustboxvalign=c,margin=0 1pt
Sample \adjustboxvalign=c,margin=0 1pt
Figure 13: Example image sequences, along with generated image samples, for the manipulation tasks that we used in our experiments.