Skew-Fit: State-Covering Self-Supervised Reinforcement Learning

  • 2019-05-31 15:30:20
  • Vitchyr H. Pong, Murtaza Dalal, Steven Lin, Ashvin Nair, Shikhar Bahl, Sergey Levine
  • 0

Abstract

In standard reinforcement learning, each new skill requires amanually-designed reward function, which takes considerable manual effort andengineering. Self-supervised goal setting has the potential to automate thisprocess, enabling an agent to propose its own goals and acquire skills thatachieve these goals. However, such methods typically rely on manually-designedgoal distributions, or heuristics to force the agent to explore a wide range ofstates. We propose a formal exploration objective for goal-reaching policiesthat maximizes state coverage. We show that this objective is equivalent tomaximizing the entropy of the goal distribution together with goal reachingperformance, where goals correspond to entire states. We present an algorithmcalled Skew-Fit for learning such a maximum-entropy goal distribution, and showthat under certain regularity conditions, our method converges to a uniformdistribution over the set of possible states, even when we do not know this setbeforehand. Skew-Fit enables self-supervised agents to autonomously choose andpractice diverse goals. Our experiments show that it can learn a variety ofmanipulation tasks from images, including opening a door with a real robot,entirely from scratch and without any manually-designed reward function.

 

Quick Read (beta)

Skew-Fit: State-Covering Self-Supervised Reinforcement Learning

Vitchyr H. Pong, Murtaza Dalalfootnotemark: , Steven Linfootnotemark: , Ashvin Nair, Shikhar Bahl, Sergey Levine
University of California, Berkeley
{vitchyr,mdalal,stevenlin598,anair17,shikharbahl,svlevine}@berkeley.edu
Equal contribution
Abstract

In standard reinforcement learning, each new skill requires a manually-designed reward function, which takes considerable manual effort and engineering. Self-supervised goal setting has the potential to automate this process, enabling an agent to propose its own goals and acquire skills that achieve these goals. However, such methods typically rely on manually-designed goal distributions, or heuristics to force the agent to explore a wide range of states. We propose a formal exploration objective for goal-reaching policies that maximizes state coverage. We show that this objective is equivalent to maximizing the entropy of the goal distribution together with goal reaching performance, where goals correspond to entire states. We present an algorithm called Skew-Fit for learning such a maximum-entropy goal distribution, and show that under certain regularity conditions, our method converges to a uniform distribution over the set of valid states, even when we do not know this set beforehand. Skew-Fit enables self-supervised agents to autonomously choose and practice reaching diverse goals. Our experiments show that it can learn a variety of manipulation tasks from images, including opening a door with a real robot, entirely from scratch and without any manually-designed reward function.

 

Skew-Fit: State-Covering Self-Supervised Reinforcement Learning


  Vitchyr H. Pongthanks: Equal contribution, Murtaza Dalalfootnotemark: , Steven Linfootnotemark: , Ashvin Nair, Shikhar Bahl, Sergey Levine University of California, Berkeley {vitchyr,mdalal,stevenlin598,anair17,shikharbahl,svlevine}@berkeley.edu

\@float

noticebox[b]Preprint. Under review.\[email protected]

1 Introduction

(a) Skew-Fit
(b) MLE
Figure 1: Left: The robot must manipulate a hook to open a door. Right: Samples from a generative model when trained with (a) Skew-Fit and (b) maximum likelihood estimation (MLE). When used as goals, the diverse samples from Skew-Fit encourage the robot to practice opening the door more frequently.

Reinforcement learning (RL) provides an appealing formalism for automated learning of behavioral skills. However, if we want agents to learn large repertoires of behaviors, this framework proves inadequate: separately learning every potentially useful skill becomes prohibitively time consuming, both in terms of the experience required for the agent, and the effort required for the user to design reward functions for each behavior. What if we could instead design an unsupervised RL algorithm that automatically explores the environment and iteratively distills this experience into general-purpose policies that can accomplish new user-specified tasks at test time?

For an agent to learn autonomously, it needs an exploration objective. An effective exploration scheme is one that visits as many states as possible, allowing a policy to autonomously prepare for any possible user-specified task that it might see at test time. We can formalize this objective as maximizing the entropy of the learned policy’s stat distribution (𝐒), since a policy that maximizes this objective should approach a uniform distribution over possible legal states. However, a short-coming of this objective is that the resulting policy cannot be used to solve new tasks: it only knows how to maximize state entropy. Thus, if we want to develop principled unsupervised RL algorithms that result in useful policies, maximizing (𝐒) is not enough. We need a mechanism that allows us to control the resulting policy to achieve new goals at test-time.

We argue that this can be accomplished by performing goal-directed exploration. In addition to maximizing the state distribution entropy (𝐒), we should be able to control where the policy goes by giving it a goal 𝐆 that corresponds to a desired state. Mathematically, this can be accomplished by stating that a goal-conditioned policy should also minimize the conditional entropy over the states given a goal, (𝐒𝐆). This objective provides us with a principled way for training a policy to explore all states (“maximize (𝐒)”) such that the state that is reached can be controlled by commanding goals (“minimize (𝐒𝐆)”).

Directly using this objective is often intractable, since it requires optimizing the marginal state distribution of the policy. However, we can avoid this difficult optimization by noting that our objective is the mutual information between the state and the goal, I(𝐒;𝐆), which can be written as:

(𝐒)-(𝐒|𝐆)=I(𝐒;𝐆)=(𝐆)-(𝐆|𝐒). (1)

Equation 1 thus gives an equivalent objective for an unsupervised RL algorithm: the agent should set diverse goals for itself (“maximize (𝐆)”) and learn how to reach them (“minimize (𝐆𝐒)”).

While the second term is the typical objective studied in goal-conditioned RL (Kaelbling, 1993; Andrychowicz et al., 2017), maximizing the diversity of goals is crucial for effectively learning to reach all possible states. In a new environment, acquiring such a maximum-entropy goal distribution is challenging: how can an agent set diverse goals when it does not even know what states exist?

In this paper, we answer this question by presenting Skew-Fit, a method for learning to model the uniform distribution over states, given only access to data collected by an autonomous goal-conditioned policy. Our paper makes the following contributions. First, we propose a principled objective for unsupervised RL given by Equation 1. While a number of prior works ignore the (𝐆) term, we argue that jointly optimizing the entire quantity is needed to develop effective and useful exploration. Second, we propose Skew-Fit, a method that, under regularity conditions, provably maximizes (𝐆) even when the underlying state space is unknown. Third, we empirically demonstrate that, when combined with goal-conditioned RL, Skew-Fit allows us to autonomously train goal-conditioned policies that reach diverse states. We test this method on a variety of simulated vision-based robot tasks without any task-specific reward function. In these experiments, Skew-Fit reaches substantially better final performance than prior methods, and learns much more quickly. We also demonstrate that our approach solves a real-world manipulation task, which requires a robot to learn to open a door, from scratch in about five hours, without any manually-designed reward function, directly from images.

2 Related Work

Many prior methods for training goal-conditioned policies assume that a goal distribution is available to sample from during exploration (Kaelbling, 1993; Schaul et al., 2015; Andrychowicz et al., 2017; Pong et al., 2018). Other methods use data collected from a randomly initialized policy or heuristics based on data collected online to design a non-parametric (Colas et al., 2018b; Warde-Farley et al., 2018; Florensa et al., 2018a) or parametric (Péré et al., 2018; Nair et al., 2018) goal distribution. We remark that Warde-Farley et al. (2018) also motivate their work in terms of minimizing a lower bound for (𝐆𝐒). Our work is complementary to these goal-reaching methods: rather than focusing on how to train goal-reaching policies, we propose a principled method for maximizing the entropy of a goal sampling distribution, (𝐆).

Our method learns without any task rewards, directly acquiring a policy that can be reused to reach user-specified goals. This stands in contrast to exploration methods that give bonus rewards based on state visitation frequency (Bellemare et al., 2016; Ostrovski et al., 2017; Tang et al., 2017; Savinov et al., 2018; Chentanez et al., 2005; Lopes et al., 2012; Stadie et al., 2016; Pathak et al., 2017; Burda et al., 2018, 2019; Mohamed & Rezende, 2015; Chentanez et al., 2005). While these methods can also be used without a task reward, they provide no mechanism for distilling the knowledge gained from visiting diverse states into flexible policies that can be applied to accomplish new goals at test-time: their policies visit novel states, and they quickly forget about them as other states become more novel.

Other prior methods extract reusable skills in the form of latent-variable-conditioned policies, where latent variables can be interpreted as options (Sutton et al., 1999) or abstract skills (Hausman et al., 2018; Gupta et al., 2018b; Eysenbach et al., 2019; Gupta et al., 2018a; Florensa et al., 2017). The resulting skills may be diverse, but they have no grounded interpretation, while our method can be used immediately after unsupervised training to reach diverse user-specified goals.

Some prior methods propose to choose goals based on heuristics such as learning progress (Baranes & Oudeyer, 2012; Veeriah et al., 2018; Colas et al., 2018a), how off-policy the goal is (Nachum et al., 2018), level of difficulty (Florensa et al., 2018b) or likelihood ranking (Zhao & Tresp, 2019). In contrast, our approach provides a principled framework for optimizing a concrete and well-motivated exploration objective, and can be shown to maximize this objective under regularity assumptions. Hazan et al. (2018) also provably optimizes a well-motivated exploration objective, but is limited to tabular MDPs, while ours is able to handle high dimensional settings such as vision-based continuous control. In order to maximize state entropy, our method employs a reweighting scheme when fitting the goal proposal model. Concurrent work has proposed a related prioritization based on ranking (Zhao & Tresp, 2019) but does not consider the unsupervised image-based reinforcement learning setting. We compare against this method in Section D.3 of the Appendix and find that our method significantly outperforms it.

3 Problem Formulation

To ensure that an unsupervised reinforcement learning agent learns to reach all possible states in a controllable way, we maximize the mutual information between the state 𝐒 and the goal 𝐆, I(𝐒;𝐆). As shown in Equation 1, this objective can be decomposed into two separate terms: minimizing (𝐆𝐒) and maximizing (𝐆).

3.1 Minimizing (𝐆𝐒): Goal-Conditioned RL

Standard RL considers a Markov decision process (MDP), which has a state space 𝒮, action space 𝒜, and unknown dynamics ρ(𝐬t+1𝐬t,𝐚t):𝒮×𝒮×𝒜[0,+). Goal-conditioned RL also includes a goal space 𝒢, which we assume to be the same as the state space, 𝒢=𝒮. 11 1 We note that goal-conditioned RL can always be formulated as a standard RL problem by appending the goal to the state. Additionally, some authors define the goal as a feature of the state. It is straightforward to apply the analysis and method presented in this paper to this setting. A goal-conditioned policy π(𝐚𝐬,𝐠) maps a state 𝐬𝒮 and goal 𝐠𝒮 to a distribution over actions 𝐚𝒜, and its objective is to reach the goal, i.e., to make the current state equal to the goal.

Goal-reaching can be formulated as minimizing (𝐆𝐒), and many practical goal-reaching algorithms (Kaelbling, 1993; Lillicrap et al., 2016; Schaul et al., 2015; Andrychowicz et al., 2017; Nair et al., 2018; Pong et al., 2018; Florensa et al., 2018a) can be viewed as approximations of this. See Appendix A for more details. Our method may thus be used in conjunction with any of these prior goal-conditioned RL methods in order to jointly minimize (𝐆𝐒) and maximize (𝐆).

3.2 Maximizing (𝐆): Setting Diverse Goals

We now turn to the problem of setting diverse goals or, mathematically, maximizing the entropy of the goal distribution (𝐆). Let U𝒮 be the uniform distribution over 𝒮, where we assume 𝒮 has finite volume so that the uniform distribution is well-defined. Let pϕ be the goal distribution, i.e. 𝐆pϕ. Our goal is to maximize the entropy of pϕ, which we write as (𝐆). Since the maximum entropy distribution over 𝒮 is the uniform distribution U𝒮, maximizing (𝐆) may seem as simple as choosing the uniform distribution to be our goal distribution: pϕ=U𝒮. However, this requires knowing the uniform distribution over valid states, which may be difficult to obtain when 𝒮 is a strict subset of n, for some n. For example, if the states correspond to images viewed through a robot’s camera, 𝒮 corresponds to the (unknown) set of valid images of the robot’s environment, while n corresponds to all possible arrays of pixel values of a particular size. In such environments, sampling from the uniform distribution n is unlikely to correspond to a valid image of the real world. Sampling uniformly from 𝒮 would require knowing the set of all possible valid images, which we assume the agent does not know when starting to explore the environment.

While we cannot sample arbitrary states from 𝒮, we can sample states by performing goal-directed exploration. To derive and analyze our method, we introduce a simple model of this process: a goal 𝐆pϕ is sampled from the goal distribution pϕ, and then the agent attempts to achieve this goal, which results in a distribution of states 𝐒𝒮 seen along the trajectory. We abstract this entire process by writing the resulting marginal distribution over 𝐒 as p(𝐒pϕ). We assume that p(𝐒pϕ) has full support, which can be accomplished with an epsilon-greedy goal reaching policy in a communicating MDP. We also assume we have access to an oracle goal reacher, so that p(𝐒pϕ)pϕ(𝐒). This simplified model allows us to analyze the behavior of our goal-setting scheme in isolation of any goal-reaching algorithm, but we show in Section 6 that we can jointly learn a goal-reaching policy.

In summary, we want to acquire a maximum-entropy goal distribution pϕ over valid states 𝒮, while only having access to state samples from p(𝐒pϕ).

4 Skew-Fit: Learning a Maximum Entropy Goal Distribution

Our method, Skew-Fit, learns a maximum entropy goal distribution pϕ using samples collected from a goal-conditioned policy. We analyze the algorithm and show that Skew-Fit maximizes the entropy of the goal distribution, and present a practical instantiation for unsupervised deep RL.

4.1 Skew-Fit Algorithm

To learn a uniform distribution over valid goal states, we present a method that iteratively increases the entropy of a generative model pϕ. In particular, given a generative model pϕt at iteration t, we would like to train a new generative model pϕt+1 such that pϕt+1 has higher entropy over the set of valid states. While we do not know the set of valid states 𝒮, we can sample states from p(𝐒pϕt), resulting in an empirical distribution pempt over the states

pempt(𝐬)=1Nn=1N𝟏{𝐬=𝐒n},𝐒np(𝐒pϕt), (2)

and use this empirical distribution to train the next generative model pϕt+1. However, if we simply train pϕt+1 to model this empirical distribution, it may not necessarily have higher entropy than pϕt.

The intuition behind our method is quite simple: rather than fitting a generative model to our empirical distribution, we skew the empirical distribution so that rarely visited states are given more weight. See Figure 2 for a visualization of this process.

Figure 2: Our method, Skew-Fit, samples goals for goal-conditioned RL in order to induce a uniform state visitation distribution. We start by sampling from our replay buffer, and weighting the states such that rare states are given more weight. We then train a generative model pϕt+1 with the weighted samples. By sampling new states with goals proposed from this new generative model, we obtain a higher entropy distribution of states in our replay buffer at the next iteration.

How should we skew the empirical distribution if we want to maximize the entropy of pϕt+1? If we had access to the density of each state, pempt(𝐒), then we could simply weight each state by 1/pempt(𝐒). We could then perform maximum likelihood estimation (MLE) for the uniform distribution by using the following loss to train ϕt+1:

(ϕ) =𝔼𝐒U𝒮[logpϕ(𝐒)]=𝔼𝐒pempt[U𝒮(𝐒)pempt(𝐒)logpϕ(𝐒)]𝔼𝐒pempt[1pempt(𝐒)logpϕ(𝐒)]

where we use the fact that the uniform distribution U𝒮(𝐒) has constant density for all states in 𝒮. However, computing this density pempt(𝐒) requires marginalizing out the MDP dynamics, which requires an accurate model of both the dynamics and the goal-conditioned policy. We avoid needing to model the entire MDP process by approximating pempt(𝐒) with our previous learned generative model: pempt(𝐒)p(𝐒pϕt)pϕt(𝐒).

Variance minimization.

This procedure relies on importance sampling (IS), which can have high variance, particularly if pϕt(𝐒)0. We reduce this variance by weighing each state by pϕt(𝐒)α, for α[-1,0) rather than pϕ(𝐒)-1. If α=-1, then we recover the exact importance sampling procedure described above. If α=0, then this skew step has no effect. By choosing intermediate values of α, we can trade off the variance introduced from dividing the weight by a potentially small pϕt(𝐒) with the speed at which we want to increase the entropy of the goal distribution. Moreover, we show in Section 4.2 that, for a range of values of α[-1,0), this procedure will converge to the uniform distribution over valid states. Therefore, we explicitly define a skewed distribution using a weight with exponent α:

pskewedt(𝐬) =1Zαpempt(𝐬)pϕt(𝐬)α,α[-1,0),Zα=n=1Npempt(𝐒n)pϕt(𝐒n)α, (3)

where Zα is the normalizing coefficient and pempt is given by Equation 2. Lastly, we fit the generative model at the next iteration pϕt+1 to pskewedt using standard MLE. We note that computing Zα does not add much computational overhead, since all importance weights already need to be computed.

Goal sampling alternative.

Because pϕt+1pskewedt, at iteration t+1, one can sample goals from either pϕt+1 or pskewedt. Sampling goals from pskewedt may be preferred if sampling from the learned generative model pϕt+1 is computationally or otherwise challenging. In either case, one still needs to train the generative model pϕt to create pskewedt. In our experiments, we found that both methods perform well.

Overall, Skew-Fit samples data from the environment and weights different samples by their density under the generative model pϕt. We prove in the next section that this weighting makes the generative model at the next iteration pϕt+1 have higher entropy. By doing so, the generative model is more likely to generate goals at the frontier of unseen states, which results in more uniform state coverage. This procedure is shown in Figure 2 and Skew-Fit is summarized in Algorithm 4.1.

{algorithm}

Skew-Fit {algorithmic}[1] \FORIteration t=1,2, \STATECollect N states {𝐒i}i=1N by sampling goals from pϕt (or pskewedt) and running goal-conditioned policy. \STATEConstruct skewed distribution pskewedt (Equation 3). \STATEFit pϕt+1 to skewed distribution pskewedt using MLE. \ENDFOR

4.2 Skew-Fit Analysis

In this section, we provide conditions under which pskewedt converges to the uniform distribution over the state space 𝒮. Our most general result is stated as follows:

Lemma 4.1.

Let S be a compact set. Define the set of distributions Q={p: support of p is S}. Let F:QQ be a continuous function and such that H(F(p))H(p) with equality if and only if p is the uniform probability distribution on S, US. Define the sequence of distributions P=(p1,p2,) by starting with any p1Q and recursively defining pt+1=F(pt).

The sequence P converges to US.

Proof.

See Appendix Section A. ∎

We will apply Lemma 4.1 to be the map from pskewedt to pskewedt+1 to show that pskewedt converges to U𝒮22 2 We take N and refer to convergence in distribution.. If we assume that the goal-conditioned policy and generative model learning procedure are well behaved (technical assumptions are stated in B.1 of the Appendix), then to apply Lemma 4.1, we only need to show that (pskewedt)(pempt) with equality if and only if pempt=U𝒮. For the simple case when pϕt=pempt identically at each iteration, we prove that this is true for any value of α[-1,0) in Lemma B.3 of the Appendix. However, in practice, pϕt only approximates pempt. To address this more realistic situation, we prove the following result:

Lemma 4.2.

Given two distribution p𝑒𝑚𝑝t and pϕt where p𝑒𝑚𝑝tpϕt 33 3 pq means that p is absolutely continuous with respect to q, i.e. p(s)=0q(s)=0. and

Cov𝐒p𝑒𝑚𝑝t[logp𝑒𝑚𝑝t(𝐒),logpϕt(𝐒)]>0, (4)

define the distribution p𝑠𝑘𝑒𝑤𝑒𝑑t as in Equation 3. Let Hα(α) be the entropy of p𝑠𝑘𝑒𝑤𝑒𝑑t for a fixed α. Then there exists a constant a<0 such that for all α[a,0),

(p𝑠𝑘𝑒𝑤𝑒𝑑t)=α(α)>(p𝑒𝑚𝑝t).
Proof.

See Appendix Section A. ∎

Thus, our generative model pϕt does not need to exactly fit the empirical distribution. We merely need for the log densities of pϕt and pempt to be correlated, which we expect to happen frequently with an accurate goal-conditioned policy, since pempt is the set of states seen when trying to reach goals from pϕt. In this case, if we choose negative values of α that are small enough, then the entropy of pskewedt will be higher than that of pempt. Empirically, we found that α values as low as α=-1 performed well.

In summary, we see that under certain assumptions, pskewedt converges to U𝒮. Since we train each generative model pϕt+1 by fitting it to pskewedt, we expect pϕt to also converge to U𝒮.

5 Skew-Fit with Goal-Conditioned Reinforcement Learning

To train agents that can autonomously explore the environment in a controllable manner, we need to maximize the mutual information between goals and states, I(𝐒;𝐆)=(𝐆)-(𝐆𝐒). Thus far, we have presented and derived Skew-Fit assuming that we have access to an oracle goal-reaching policy, allowing us to separately analyze how we can maximize (𝐆). However, in practice we do not have access to such an oracle, and in this section we discuss how one can combine Skew-Fit with existing goal-conditioned reinforcement learning to maximize both of these terms jointly.

Maximizing I(𝐒;𝐆) can be done by simultaneously performing Skew-Fit and training a goal conditioned policy to minimize (𝐆𝐒), or, equivalently, maximize -(𝐆𝐒). Maximizing -(𝐆𝐒) requires computing the density logp(𝐆𝐒), which may be difficult to compute without strong modeling assumptions. However, for any distribution q, the following lower bound for -(𝐆𝐒) holds:

-(𝐆𝐒) =𝔼(𝐆,𝐒)pϕt,π[logq(𝐆𝐒)]+DKL(pq)𝔼(𝐆,𝐒)pϕt,π[logq(𝐆𝐒)],

where DKL denotes Kullback–Leibler divergence as discussed by Barber & Agakov (2004). Thus, to minimize (𝐆𝐒), we train a policy to maximize the following reward:

r(𝐒,𝐆)=logq(𝐆𝐒).

Since Skew-Fit uses a generative model to propose goals, it is particularly natural to combine with reinforcement learning with imagined goals (RIG) Nair et al. (2018), though in principle any goal-conditioned method could be used. RIG is an efficient off-policy goal-conditioned method that solves the vision-based RL problem in a learned latent space. In particular, RIG fits a VAE and uses it to encode all observations and goals into a latent space. RIG also uses the generative model for both goal sampling and compute rewards, logq(𝐆𝐒). Applying Skew-Fit to RIG then amounts to using Skew-Fit rather than MLE to train the VAE. See Appendix E for details on how to do so.

6 Experiments

Our experimental evaluation of Skew-Fit aims to study the following empirical questions: (1) Can Skew-Fit learn a generative model that can model the uniform distribution over the set of valid states? (2) When combined with goal-conditioned RL, can Skew-Fit enable agents to autonomously set and learn to reach a diverse set of goals?

Figure 3: We evaluate on these continuous control environments. From left to right: Visual Pusher, a simulated pushing task; Visual Door, a door opening task; Visual Pickup, a picking task; and Real World Visual Door, a real world door opening task. All tasks are solved from images and without any task-specific reward. See Appendix C for details.
Figure 4: (Left) Learning curves for simulated continuous control experiments. Lower is better. For each environment and method, we show the mean and standard deviation of 6 seeds and smooth temporally across 25 epochs within each seed. Skew-Fit consistently outperforms RIG and various baselines. See the text for description of each method. (Right) The first column displays example test goal images for each environment. In the next two columns, we display final images reached by Skew-Fit and RIG respectively. Under each image is the final distance in state space to provide a notion of the behavior of each method in the plots.

To answer these questions, we evaluate Skew-Fit with RIG on vision-based continuous control. In this setting, the agent must control a robot arm using only image observations, without access to any ground truth reward signal. We test our method on three different simulated continuous control tasks: Visual Door, Visual Pusher, and Visual Pickup. See Figure 3 for visuals and Section E for details of these environments. Training policies for these tasks is done in a completely unsupervised manner without access to either any prior information about the state-space or an oracle goal-sampling distribution. However, to evaluate their performance, we sample goal images from a uniform distribution over valid images and report the agent’s final distance to the corresponding simulator states (e.g. distance of the puck to the target puck location). While this evaluation method and metric is only practical in simulation, it provides us with a quantitative metric for measuring an RL method’s ability to reach a broad coverage of goals in a vision-based setting.

To train policies in image-based environments, for each method we compare to, with the exception of Warde-Farley et al. (2018), we apply it alongside RIG in order to enable them to solve visual tasks and ensure a fair comparison across methods. Additionally, we replace TD3(Fujimoto et al., 2018) with soft actor critic (SAC) Haarnoja et al. (2018) for all the methods that use RIG including Skew-Fit. We found that SAC performed as well or better than TD3 in every environment. In RIG, the VAE was pre-trained on a uniform sampling of images from the state space of each environment. In order to ensure a fair comparison to Skew-Fit, we forego pre-training and instead train the VAE alongside RL, using the variant described in the RIG paper.

First, we compare to standard RIG, without Skew-Fit. We ran RIG using the relabeling scheme described in the hindsight experience replay (HER) paper to form the baseline, HER. Florensa et al. (2018b) samples goals from a GAN based on the difficulty of reaching the goal. We compare against this method by replacing pϕ with the GAN and label it AutoGoal GAN. We compare to Warde-Farley et al. (2018), which uses a non-parametric approach based on clustering to sample goals and a state discriminator to compute rewards. We denote this baseline as DISCERN. We also compare to the goal proposal mechanism proposed by Warde-Farley et al. (2018) without the discriminative reward in DISCERN, which we label DISCERN-g. Lastly, we compare our method to two exploration methods based on reward bonuses: ICM (Pathak et al., 2017), and # Exploration (Tang et al., 2017), denoted ICM and #Exploration respectively.

We see in Figure 4 that Skew-Fit significantly outperforms prior methods both in terms of task performance and sample complexity. The most common failure mode for prior methods is that the goal sampling distribution collapses, resulting in the agent learning to reach only a fraction of the state space. For comparison, more goals sampled from the generative model trained with Skew-Fit and RIG are shown in Figure 13 in the Appendix. We can see that standard RIG produces a small non-diverse distribution for each environment: the object is in the same place for pickup, the puck is often in the starting position for pushing, and the door is always closed.

In contrast, Skew-Fit proposes goals where the object is in the air and on the ground, varied puck positions, and both open and closed door angles. The direct effect of these goal choices can be seen in Figure 12 in the appendix, which shows example reached goals for RIG and Skew-Fit. Standard RIG only learns to reach states close to the initial position, while Skew-Fit learns to reach the entire state space. A quantitative comparison between the two methods on the pickup task can be seen in Figure 10 in the appendix, which gives the cumulative total exploration pickups for each method. From the graph, we can see that only Skew-Fit learns to pay attention to the object, while other methods ignore it completely.

Figure 5: Learning curve for Real World Visual Door environment. We visually label a success if the policy opens the door to the target angle by the last state of the trajectory. Skew-Fit results in considerable sample efficiency gains over prior work on this real-world task.

Real-world vision-based robotic manipulation

Finally, we demonstrate that Skew-Fit scales well to the real world with a door opening task, Real World Visual Door. See Figure 3 for a picture of this environment. While a number of prior works have studied RL-based learning of door opening Kalakrishnan et al. (2011); Chebotar et al. (2017), we demonstrate the first method for autonomous learning of door opening without a user-provided, task-specific reward function. As in simulation, we do not provide any goals to the agent and simply let it interact with the door to solve the door opening task from scratch, without any human guidance or reward signal. We train agents using Skew-Fit as well as RIG. Unlike in simulation, we cannot measure the difference between the policy’s achieved and desired door angle since we do not have access to the true state of the world. Instead, we simply visually denote a binary success/failure for each goal based on whether the last state in the trajectory achieves the target angle. Every seven and a half minutes of interaction time we evaluate on 5 goals and plot the cumulative successes for each method. As Figure 5 shows, standard RIG only starts to open the door consistently after five hours of training. In contrast, Skew-Fit learns to open the door after three hours of training and achieves a perfect success rate after five and a half hours of interaction time, demonstrating that Skew-Fit is a promising technique for solving real world tasks without any human-provided reward function. Videos of our method solving this task, along with the simulated environments, can be viewed on our website.44 4 https://sites.google.com/view/skew-fit

Additional Experiments

We perform additional experiments including a more thorough analysis in a simple 2D setting, a study of how Skew-Fit can be used to train generative models on imbalanced data-sets, and a sensitivity analysis over the parameter α, in Appendix D. Additionally, Appendix E provides a complete description of all our environments as well as our method hyper-parameters.

7 Conclusion

We presented Skew-Fit, an algorithm for training a generative model to closely approximate a uniform distribution over valid states, using data obtained via goal-conditioned reinforcement learning. Our method iteratively re-weights the samples for training the generative model, such that its entropy increases over the set of possible states, and our theoretical analysis gives conditions under which Skew-Fit converges to the uniform distribution. When such a model is used to choose goals for exploration and to relabeling goals for training, the resulting method results in much better coverage of the state space, enabling our method to explore effectively. Our experiments show that it produces quantifiable improvements when used along with goal-conditioned reinforcement learning on simulated robotic manipulation tasks, and can be used to learn a complex door opening skill to reach a 100% success rate directly on a real robot, without any human-provided reward supervision.

8 Acknowledgement

This research was supported by Berkeley DeepDrive, Huawei, ARL DCIST CRA W911NF-17-2-0181, NSF IIS-1651843, and the Office of Naval Research, as well as Amazon, Google, and NVIDIA. We thank Aviral Kumar, Carlos Florensa, Aurick Zhou, Nilesh Tripuraneni, Vickie Ye, Dibya Ghosh, Coline Devin, and John D. Co-Reyes for their insightful discussions and feedback.

References

  • Andrychowicz et al. (2017) Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., Mcgrew, B., Tobin, J., Abbeel, P., and Zaremba, W. Hindsight Experience Replay. In Advances in Neural Information Processing Systems (NIPS), 2017.
  • Baranes & Oudeyer (2012) Baranes, A. and Oudeyer, P.-Y. Active Learning of Inverse Models with Intrinsically Motivated Goal Exploration in Robots. Robotics and Autonomous Systems, 61(1):49–73, 2012. doi: 10.1016/j.robot.2012.05.008.
  • Barber & Agakov (2004) Barber, D. and Agakov, F. V. Information maximization in noisy channels: A variational approach. In Advances in Neural Information Processing Systems, pp. 201–208, 2004.
  • Bellemare et al. (2016) Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems (NIPS), pp. 1471–1479, 2016.
  • Billingsley (2013) Billingsley, P. Convergence of probability measures. John Wiley & Sons, 2013.
  • Burda et al. (2018) Burda, Y., Edwards, H., Storkey, A., and Klimov, O. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
  • Burda et al. (2019) Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., and Efros, A. A. Large-scale study of curiosity-driven learning. In International Conference on Learning Representations (ICLR), 2019.
  • Chebotar et al. (2017) Chebotar, Y., Kalakrishnan, M., Yahya, A., Li, A., Schaal, S., and Levine, S. Path integral guided policy search. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3381–3388. IEEE, 2017.
  • Chentanez et al. (2005) Chentanez, N., Barto, A. G., and Singh, S. P. Intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pp. 1281–1288, 2005.
  • Colas et al. (2018a) Colas, C., Fournier, P., Sigaud, O., and Oudeyer, P. CURIOUS: intrinsically motivated multi-task, multi-goal reinforcement learning. CoRR, abs/1810.06284, 2018a.
  • Colas et al. (2018b) Colas, C., Sigaud, O., and Oudeyer, P.-Y. Gep-pg: Decoupling exploration and exploitation in deep reinforcement learning algorithms. International Conference on Machine Learning (ICML), 2018b.
  • Eysenbach et al. (2019) Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. Diversity is All You Need: Learning Skills without a Reward Function. In International Conference on Learning Representations (ICLR), 2019.
  • Florensa et al. (2017) Florensa, C., Duan, Y., and Abbeel, P. Stochastic neural networks for hierarchical reinforcement learning. In International Conference on Learning Representations (ICLR), 2017.
  • Florensa et al. (2018a) Florensa, C., Degrave, J., Heess, N., Springenberg, J. T., and Riedmiller, M. Self-supervised Learning of Image Embedding for Continuous Control. In Workshop on Inference to Control at NeurIPS, 2018a.
  • Florensa et al. (2018b) Florensa, C., Held, D., Geng, X., and Abbeel, P. Automatic Goal Generation for Reinforcement Learning Agents. In International Conference on Machine Learning (ICML), 2018b.
  • Fujimoto et al. (2018) Fujimoto, S., van Hoof, H., and Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. In International Conference on Machine Learning (ICML), 2018.
  • Gupta et al. (2018a) Gupta, A., Eysenbach, B., Finn, C., and Levine, S. Unsupervised meta-learning for reinforcement learning. CoRR, abs:1806.04640, 2018a.
  • Gupta et al. (2018b) Gupta, A., Mendonca, R., Liu, Y., Abbeel, P., and Levine, S. Meta-Reinforcement Learning of Structured Exploration Strategies. In Advances in Neural Information Processing Systems (NIPS), 2018b.
  • Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., and Levine, S. Soft actor-critic algorithms and applications. CoRR, abs/1812.05905, 2018.
  • Hausman et al. (2018) Hausman, K., Springenberg, J. T., Wang, Z., Heess, N., and Riedmiller, M. Learning an Embedding Space for Transferable Robot Skills. In International Conference on Learning Representations (ICLR), pp. 1–16, 2018.
  • Hazan et al. (2018) Hazan, E., Kakade, S. M., Singh, K., and Soest, A. V. Provably efficient maximum entropy exploration. CoRR, abs/1812.02690, 2018.
  • Kaelbling (1993) Kaelbling, L. P. Learning to achieve goals. In International Joint Conference on Artificial Intelligence (IJCAI), volume vol.2, pp. 1094 – 8, 1993.
  • Kalakrishnan et al. (2011) Kalakrishnan, M., Righetti, L., Pastor, P., and Schaal, S. Learning force control policies for compliant manipulation. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4639–4644. IEEE, 2011.
  • Kingma & Welling (2014) Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. In International Conference on Learning Representations (ICLR), 2014.
  • Lillicrap et al. (2016) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016. ISBN 0-7803-3213-X. doi: 10.1613/jair.301.
  • Lopes et al. (2012) Lopes, M., Lang, T., Toussaint, M., and Oudeyer, P.-Y. Exploration in model-based reinforcement learning by empirically estimating learning progress. In Advances in Neural Information Processing Systems, pp. 206–214, 2012.
  • Mohamed & Rezende (2015) Mohamed, S. and Rezende, D. J. Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pp. 2125–2133, 2015.
  • Nachum et al. (2018) Nachum, O., Brain, G., Gu, S., Lee, H., and Levine, S. Data-Efficient Hierarchical Reinforcement Learning. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • Nair et al. (2018) Nair, A., Pong, V., Dalal, M., Bahl, S., Lin, S., and Levine, S. Visual Reinforcement Learning with Imagined Goals. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • Nielsen & Nock (2010) Nielsen, F. and Nock, R. Entropies and cross-entropies of exponential families. In Image Processing (ICIP), 2010 17th IEEE International Conference on, pp. 3621–3624. IEEE, 2010.
  • Ostrovski et al. (2017) Ostrovski, G., Bellemare, M. G., Oord, A., and Munos, R. Count-based exploration with neural density models. In International Conference on Machine Learning, pp. 2721–2730, 2017.
  • Pathak et al. (2017) Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-Driven Exploration by Self-Supervised Prediction. In International Conference on Machine Learning (ICML), pp. 488–489. IEEE, 2017.
  • Péré et al. (2018) Péré, A., Forestier, S., Sigaud, O., and Oudeyer, P.-Y. Unsupervised Learning of Goal Spaces for Intrinsically Motivated Goal Exploration. In International Conference on Learning Representations (ICLR), 2018.
  • Pong et al. (2018) Pong, V., Gu, S., Dalal, M., and Levine, S. Temporal Difference Models: Model-Free Deep RL For Model-Based Control. In International Conference on Learning Representations (ICLR), 2018.
  • Savinov et al. (2018) Savinov, N., Raichuk, A., Marinier, R., Vincent, D., Pollefeys, M., Lillicrap, T., and Gelly, S. Episodic curiosity through reachability. arXiv preprint arXiv:1810.02274, 2018.
  • Schaul et al. (2015) Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal Value Function Approximators. In International Conference on Machine Learning (ICML), pp. 1312–1320, 2015. ISBN 9781510810587.
  • Stadie et al. (2016) Stadie, B. C., Levine, S., and Abbeel, P. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models. In International Conference on Learning Representations (ICLR), 2016.
  • Sutton et al. (1999) Sutton, R. S., Precup, D., and Singh, S. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
  • Tang et al. (2017) Tang, H., Houthooft, R., Foote, D., Stooke, A., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning. In Neural Information Processing Systems (NIPS), 2017.
  • Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. MuJoCo: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5026–5033, 2012. ISBN 9781467317375. doi: 10.1109/IROS.2012.6386109.
  • Veeriah et al. (2018) Veeriah, V., Oh, J., and Singh, S. Many-goals reinforcement learning. arXiv preprint arXiv:1806.09605, 2018.
  • Warde-Farley et al. (2018) Warde-Farley, D., de Wiele, T. V., Kulkarni, T., Ionescu, C., Hansen, S., and Mnih, V. Unsupervised control through non-parametric discriminative rewards. CoRR, abs/1811.11359, 2018.
  • Zhao & Tresp (2019) Zhao, R. and Tresp, V. Curiosity-driven experience prioritization via density estimation. CoRR, abs/1902.08039, 2019.

Appendix A Goal-Conditioned RL minimizes (𝐆𝐒)

Some goal-conditioned RL methods such as Warde-Farley et al. (2018); Nair et al. (2018) present methods for minimizing a lower bound for (𝐆𝐒), by approximating logp(𝐆𝐒) and using it as the reward. While, other goal-conditioned RL methods  (Kaelbling, 1993; Lillicrap et al., 2016; Schaul et al., 2015; Andrychowicz et al., 2017; Pong et al., 2018; Florensa et al., 2018a) are not developed with the intention of minimizing the conditional entropy (𝐆𝐒), the optimal goal-conditioned policy will deterministically reach the goal, resulting in a conditional entropy of zero: (𝐆𝐒)=0. Thus, goal-conditioned RL methods effectively minimize (𝐆𝐒).

Appendix B Analysis

We note several intermediate results that are necessary in the proof of Skew-Fit.

B.1 Proofs of Lemmas for the Analysis of Skew-Fit

Let qp means that q is absolutely continuous with respect to p, i.e. p(x)=0q(x)=0.

Assumption B.1.

Assume that

  • 𝒮 is compact.

  • The maps from p𝑠𝑘𝑒𝑤𝑒𝑑t to pϕt+1 and pϕt to p𝑒𝑚𝑝t are continuous and do not decrease entropy.

  • The support of pϕt contains 𝒮.

The assumption that 𝒮 is compact is easily achieved in most application and makes U𝒮 well defined.

The continuity assumption states that the method used to fit pϕt+1 and the goal-conditioned policy are well-behaved.

Note that we do not assume that these procedures for obtain pϕt+1 and pempt+1 increase the entropy, and only require that they do not decrease the entropy. Making a statement without these entropy assumption would be difficult: if the goal-conditioned policy ignored the goal and always entered the same state, or if the generative model only captured a single mode pskewedt, it would be challenging for any procedure to result in the uniform distribution.

Next, the assumption that the support of pϕt contains 𝒮 ensures that pskewedt is well defined and a continuous function of pempt and pϕt. Note that pϕt can have support larger than 𝒮, and so we can choose to optimize pϕt over any class of generative models with wide supports, without needing to know the manifold 𝒮.

Under these assumptions, to use Lemma 4.1 for Skew-Fit, we must show (pskewedt)(pempt) with equality if and only if pempt=U𝒮.

Lemma B.2.

Let S be a compact set. Define the set of distributions Q={p: support of p is S}. Let F:QQ be a continuous function and such that H(F(p))H(p) with equality if and only if p is the uniform probability distribution on S, US. Define the sequence of distributions P=(p1,p2,) by starting with any p1Q and recursively defining pt+1=F(pt).

The sequence P converges to US.

Proof.

The uniform distribution U𝒮 is well defined since 𝒮 is compact. Because 𝒮 is a compact set, by Prokhorov’s Theorem Billingsley (2013), the set 𝒬 is sequentially compact. Thus, P has a convergent subsequence P=(pk1,pk2,)P for k1<k2< that converges to a distribution p*𝒬. Because is continuous, p* must be a fixed point of since by the convergence mapping theorem, we have that

limipki=p* limi(pki)=(p*)

and so

p* =limipki
=limi(pki-1)
=(p*).

The only fixed point of is U𝒮 since for any distribution p that is not the uniform distribution, U𝒮, we have that ((p))>(p) which implies that (p)p. Thus, P converges to the only fixed point, U𝒮. Since the entropy cannot decrease, then entropy of the distributions in P must also converge to the entropy of U𝒮. Lastly, since entropy is a continuous function of distribution, P must converge to U𝒮. ∎

Lemma B.3.

Assume the set S has finite volume so that its uniform distribution US is well defined and has finite entropy. Given any distribution p(s) whose support is S, recursively define pα

pα(𝐬) =1Zαp(𝐬)α,𝐬𝒮

where Zα is the normalizing constant and α[0,1)55 5 In the paper, α[-1,0). However, when pempt=pϕt, Equation 3 becomes pskewedt(S) =1Zαpϕt(S)pϕt(S)α,α[-1,0) =1Zαpϕt(S)α,α[0,1) .

For all α[0,1),

(pα)(p)

with equality if and only if p is US, the uniform distribution S.

Proof.

If α=0 or p is the uniform distribution, the result is clearly true. We now study the case where α(0,1) and pU𝒜.

Define the one-dimensional exponential family {pθt:α[0,1]} where pθt is

pθt(𝐬)=eαT(𝐬)-A(α)+k(𝐬)

with log carrier density k(𝐬)=0, natural parameter α, sufficient statistic T(𝐬)=logpt(𝐬), and log-normalizer A(α)=𝒮eαT(𝐬)𝑑𝐬. As shown in Nielsen & Nock (2010), the entropy of a distribution from a one-dimensional exponential family with parameter α is given by:

θt(α)(pθt)=A(α)-αA(α)

The derivative with respect to α is then

ddαdθt(α) =-αA′′(α)
=-αVar𝐬pθt[T(𝐬)]
=-αVar𝐬pθt[logpt(𝐬)]
0

where we use the fact that the nth derivative of A(α) is the n central moment, i.e. A′′(α)=Var𝐬pθt[T(𝐬)]. Since variance is always non-negative, this means the entropy is monotonically decreasing with α, and so

(pα)(p1)=(p)

with equality if and only if

Var𝐬pθt[logp(𝐬)]=0.

However, this only happens if logp(𝐬) is constant over its support, i.e. it is the uniform distribution over its support. ∎

We also prove the convergence directly for the (even more) simplified case when pskewedt=pϕt+1=pempt+1 using a similar technique:

Lemma B.4.

Assume the set S has finite volume so that its uniform distribution US is well defined and has finite entropy. Given any distribution p(s) whose support is S, recursively define pt with p1=p and

pt+1(𝐬) =1Zαtpt(𝐬)α,𝐬𝒮

where Zαt is the normalizing constant and α[0,1).

The sequence (p1,p2,) converges to US, the uniform distribution S.

Proof.

If α=0, then p2 (and all subsequent distributions) will clearly be the uniform distribution. We now study the case where α(0,1).

At each iteration t, define the one-dimensional exponential family {pθt:θ[0,1]} where pθt is

pθt(𝐬)=eθT(𝐬)-A(θ)+k(𝐬)

with log carrier density k(𝐬)=0, natural parameter θ, sufficient statistic T(𝐬)=logpt(𝐬), and log-normalizer A(θ)=𝒮eθT(𝐬)𝑑𝐬. As shown in Nielsen & Nock (2010), the entropy of a distribution from a one-dimensional exponential family with parameter θ is given by:

θt(θ)(pθt)=A(θ)-θA(θ)

The derivative with respect to θ is then

ddθdθt(θ) =-θA′′(θ)
=-θVar𝐬pθt[T(𝐬)]
=-θVar𝐬pθt[logpt(𝐬)] (5)
0

where we use the fact that the nth derivative of A(θ) is the n central moment, i.e. A′′(θ)=Var𝐬pθt[T(𝐬)]. Since variance is always non-negative, this means the entropy is monotonically decreasing with θ. Note that pt+1 is a member of this exponential family, with parameter θ=α(0,1). So

(pt+1)=θt(α)θt(1)=(pt)

which implies

(p1)(p2).

This monotonically increasing sequence is upper bounded by the entropy of the uniform distribution, and so this sequence must converge.

The sequence can only converge if ddθθt(θ) converges to zero. However, because α is bounded away from 0, Equation 5 states that this can only happen if

Var𝐬pθt[logpt(𝐬)]0. (6)

Because pt has full support, then so does pθt. Thus, Equation 6 is only true if logpt(𝐬) converges to a constant, i.e. pt converges to the uniform distribution. ∎

Lemma B.5.

Given two distribution p(x) and q(x) where pq and

0<Covp[logp(X),logq(X)] (7)

define the distribution pα as

pα(x) =1Zαp(x)q(x)α

where αR and Zα is the normalizing factor. Let Hα(α) be the entropy of pα. Then there exists a constant a>0 such that for all α[-a,0),

α(α)>α(0)=(p). (8)
Proof.

Observe that {pα:α[-1,0]} is a one-dimensional exponential family

pα(x)=eαT(x)-A(α)+k(x)

with log carrier density k(x)=logp(x), natural parameter α, sufficient statistic T(x)=logq(x), and log-normalizer A(α)=𝒳eαT(x)+k(x)𝑑x. As shown in Nielsen & Nock (2010), the entropy of a distribution from a one-dimensional exponential family with parameter α is given by:

α(α)(pα)=A(α)-αA(α)-𝔼pα[k(X)]

The derivative with respect to α is then

ddαα(α) =-αA′′(α)-ddα𝔼pα[k(x)]
=-αA′′(α)-𝔼α[k(x)(T(x)-A(α)]
=-αVarpα[T(x)]-Covpα[k(x),T(x)]

where we use the fact that the nth derivative of A(α) give the n central moment, i.e. A(α)=𝔼pα[T(x)] and A′′(α)=Varpα[T(x)]. The derivative of α=0 is

ddαα(0) =-Covp0[k(x),T(x)]
=-Covp[logp(x),logq(x)]

which is negative by assumption. Because the derivative at α=0 is negative, then there exists a constant a>0 such that for all α[-a,0], α(α)>α(0)=(p). ∎

Appendix C Environment Details

Point-Mass: In this environment, an agent must learn to navigate a square-shaped corridor (see Figure 6). The observation is the 2D position, and the agent must specify a velocity as the 2D action. The reward at each time step is the negative distance between the achieved position and desired position.

Visual Pusher: A MuJoCo environment with a 7-DoF Sawyer arm and a small puck on a table that the arm must push to a target position. The agent controls the arm by commanding x,y position for the end effector (EE). The underlying state is the EE position, e and puck position p. The evaluation metric is the distance between the goal and final puck positions. The hand goal/state space is a 10x10 cm2 box and the puck goal/state space is a 30x20 cm2 box. Both the hand and puck spaces are centered around the origin. The action space ranges in the interval [-1,1] in the x and y dimensions.

Visual Door: A MuJoCo environment with a 7-DoF Sawyer arm and a door on a table that the arm must pull open to a target angle. Control is the same as in Visual Pusher. The evaluation metric is the distance between the goal and final door angle, measured in radians. In this environment, we do not reset the position of the hand or door at the end of each trajectory. The state/goal space is a 5x20x15 cm3 box in the x,y,z dimension respectively for the arm and an angle between [0,.83] radians. The action space ranges in the interval [-1,1] in the x, y and z dimensions.

Visual Pickup: A MuJoCo environment with the same robot as Visual Pusher, but now with a different object. The object is cube-shaped, but a larger intangible sphere is overlaid on top so that it is easier for the agent to see. Moreover, the robot is constrained to move in 2 dimension: it only controls the y,z arm positions. The x position of both the arm and the object is fixed. The evaluation metric is the distance between the goal and final object position. For the purpose of evaluation, 75% of the goals have the object in the air and 25% have the object on the ground. The state/goal space for both the object and the arm is 10cm in the y dimension and 13cm in the z dimension. The action space ranges in the interval [-1,1] in the y and z dimensions.

Real World Visual Door: A Rethink Sawyer Robot with a door on a table. The arm must pull the door open to a target angle. The agent controls the arm by commanding the x,y,z velocity of the EE. Our controller commands actions at a rate of up to 10Hz with the scale of actions ranging up to 1cm in magnitude. The underlying state and goal is the same as in Visual Door. Again we do not reset the position of the hand or door at the end of each trajectory. We obtain images using a Kinect Sensor and our robotic control code can be found at https://github.com/mdalal2020/sawyer_control.git The state/goal space for the environment is a 10x10x10 cm3 box. The action space ranges in the interval [-1,1] (in cm) in the x, y and z dimensions. The door angle lies in the range [0,45] degrees.

Appendix D Additional Experiments

Section D.1 studies the analyzes how well Skew-Fit can learn a maximum entropy goal distribution in isolation of a perfect goal reacher. We study this question in the context of a simple 2-dimensional navigation environment. Section D.3 analyzes the performance of Skew-Fit when combined with RIG on a variety of simulated and real vision-based robot manipulation tasks.

Figure 6: (Left) The set of final states visited by our agent and MLE over the course of training. In contrast to MLE, our method quickly approaches a uniform distribution over the set of valid states. (Right) The entropy of the sample data distribution, which quickly reaches its maximum for Skew-Fit. The entropy was calculated via discretization onto a 60 by 60 grid.

D.1 Skew-Fit on Simplified RL Environment

We analyze the effect of Skew-Fit for learning a goal distribution in isolation of learning a goal reacher by studying an idealized example where the policy is a near-perfect goal-reaching policy.

The MDP is defined on a 2-by-2 unit square-shaped corridor (see Figure 6). At the beginning of an episode, the agent begins in the bottom-left corner and samples a goal from a goal distribution pϕt. The policy reaches the state that is closest to this goal and inside the corridor, giving us a state 𝐒 to add to our empirical distribution. We compare Skew-Fit to a goal sampling distribution that is only trained using maximum likelihood estimation (MLE).

As seen in Figure 6, Skew-Fit results in learning a high entropy, near-uniform distribution over the state space. In contrast, MLE only models the states that are explored by the initial noise of the policy, resulting in the policy only setting goals in and exploring the bottom left corner. These results empirically validate that naively using previous experience to set goals will not result in diverse exploration and that Skew-Fit results in a maximum-entropy goal-sampling distribution.

D.2 Modeling Visual Observations

We would like to use Skew-Fit to learn maximum-entropy distributions over complex, high-dimensional state spaces, where we cannot manually design these uniform distributions. The next set of experiments study how Skew-Fit can be used to train a generative model to sample diverse images when trained on an imbalanced dataset. For these experiments, we use a simulated MuJoCo (Todorov et al., 2012) environment and real environment that each consists of a 7 degree of freedom robot arm in front of a door that it can open. See Figure 3 for a visualization of the simulated and real-world door environment, and the Appendix for more details on the environment.

We generate a dataset of images from the environment by running a policy that samples actions uniformly at random. Such a policy represents a particularly challenging setting for standard VAE training methods: a policy that chooses random actions does not visit uniformly random states, but rather states that are heavily biased toward ones that resemble the initial state. In the door opening environment, this means that many of the samples have the door in the initial closed position, and only the robot’s arm moves. We then train two generative models on these datasets: one using Skew-Fit and another using MLE. For our generative model, we use the same generative model as the one in RIG, a variational autoencoder Kingma & Welling (2014). To estimate the likelihood of our data, we use Monte Carlo estimation and importance sampling to marginalize the latent variables. See Appendix Section E.2 for experimental details.

In this experiment, we do not train a goal-conditioned policy, and instead only study how Skew-Fit can be used to effectively “balance” this dataset. As can be seen in Figure 7 and Figure 1, the samples produced by the model trained with Skew-Fit generally have a much wider range of door angles, while the model trained with MLE only captures a single mode of the door, both for simulated and real-world images.

D.3 Skew-Fit with Learned Policies

2D navigation

We reproduce the 2D navigation environment experiment from Section D.1, and replace the oracle goal-reacher with a goal-reaching policy that is simultaneously trained with the goal setter. The policy outputs velocities with maximum speed of one. Evaluation goals are chosen uniformly over the valid states. In Figure 7(a), we can see that a policy trained with a goal distribution trained by Skew-Fit consistently learns to reach all goals, whereas a goal distribution trained with MLE results in a policy that fails to reach states far from the starting position.

Sensitivity Analysis

We study the sensitvity of the α hyperparameter by testing values of α[-1,-0.75,-0.5,-0.25,0] on the Visual Door and Visual Pusher task. The results are included in Figure 11 and shows that our method is relatively robust to different parameters of α, particularly for the more challenging Visual Pusher task.

Prioritization Ablation

We test Skew-Fit against a concurrent method  Zhao & Tresp (2019) that suggests a different strategy in which pskewedt(𝐬)=1Zαpempt(𝐬)pϕt(𝐬)α,α[-1,0) is replaced with pskewedt(𝐬)=1Zpempt(𝐬)rank(1-pϕt(𝐬)). The results are shown in Figure 9. Besides the Visual Door Opening Environment, which is the easiest of three simulated tasks, our method performs significantly better than rank-based weighting, illustrating the effectiveness of Skew-Fit.

(a) Skew-Fit
(b) MLE
Figure 7: Samples from a generative model pϕ when trained with (a) Skew-Fit and with (b) maximum likelihood estimation trained on images from the simulated door opening task. The models are trained on a dataset collected by executing a random policy in the environment, which results in mostly images with a closed door and only occasional images with the door open. Note that the Skew-Fit samples are substantially more diverse, meaning that if pϕ were used to sample goals, it would encourage the agent to practice opening the door more frequently.
(a)
(b)
Figure 8: (a) Comparison of Skew-Fit vs MLE goal sampling on final distance to goal on RL version of the pointmass environment. Skew-Fit consistently learns to solve the task, while MLE often fails. (b) Heatmaps of final distance to each possible goal location for Skew-Fit and MLE. Skew-Fit learns a good policy over the entire state space, but MLE performs poorly for states far away from the starting position.
(a)
(b)
(c)
Figure 9: Comparison of Skew-Fit and Rank Based Prioritization as described in  Zhao & Tresp (2019). We demonstrate that with the exclusion of the Visual Door Environment our method significantly outperforms this concurrent work.
Figure 10: Cumulative total pickups during exploration for each method. RIG + Skew-Fit quickly learns to learn to pick up the object while the baselines fail to pay attention to the object.

Appendix E Implementation Details

E.1 2D Navigation Experiments

We initialize the VAE to only output points in the bottom left corner of the environment. Both the encoder and decoder have ReLU hidden activations, 2 hidden layers with 32 units, and no output activations. The VAE has a latent dimension of 16 and a Gaussian decoder trained with mean-squared error loss, batch size of 500, and 100 epochs per iteration. For Skew-Fit hyperparameters, α=-0.5 and N=10000.

For the RL version of this task, the VAE was trained in the same way. The RL hyperparameters are listed in Table 1.

(a)
(b)
Figure 11: We sweep different values of α on (a) Visual Door and (b) Visual Pusher. Skew-Fit helps the finally performance marginally on the Visual Door task, and even degrades performance if α is large in magnitude. In the more challenging Visual Pusher task, we see that Skew-Fit consistently helps and halves the final distance.

E.2 Vision-Based Continuous Control Experiments

For our underlying RL algorithm, we use a modified version of soft actor critic (SAC) with automated entropy tuning  Haarnoja et al. (2018) and twin Q-functions Fujimoto et al. (2018). This is in contrast to the original RIG Nair et al. (2018) paper which used TD3 Fujimoto et al. (2018). We found that maximum entropy policies in general improved the performance of RIG, and that we did not need to add noise on top of the stochastic policy’s noise. For our RL network architectures and training scheme, we use fully connected networks for the policy, Q-function and value networks with two hidden layers of size 400 and 300 each. We also delay training any of these networks for 10000 time steps in order to collect sufficient data for the replay buffer as well as to ensure the latent space of the VAE is relatively stable (since we train the VAE online in this setting). As in RIG, we train a goal-conditioned value functions Schaul et al. (2015) using hindsight experience replay Andrychowicz et al. (2017), relabelling 50% of exploration goals as goals sampled from the VAE prior 𝒩(0,1) and 30% from future goals in the trajectory.

In our experiments, we use an image size of 48x48. For our VAE architecture, we use a modified version of the architecture used in the original RIG paper  Nair et al. (2018). Our VAE has three convolutional layers with kernel sizes: 5x5, 3x3, and 3x3, number of output filters: 16, 32, and 64 and strides: 3, 2, and 2. We then have a fully connected layer with the latent dimension number of units, and then reverse the architecture with de-convolution layers. We vary the latent dimension of the VAE, the β term of the VAE and the α term for Skew-Fit based on the environment. Additionally, we vary the training schedule of the VAE based on the environment. See the table at the end of the appendix for more details. Our VAE has a Gaussian decoder with identity variance, meaning that we train the decoder with a mean-squared error loss.

We estimate the density under the VAE by using a sample-wise approximation to the marginal over x estimated using importance sampling:

pϕt(x) =𝔼zqθt(z|x)[p(z)qθt(z|x)pψt(xz)]
1Ni=1N[p(z)qθt(z|x)pψt(xz)].

where qθ is the encoder, pψ is the decoder, and p(z) is the prior, which in this case is unit Gaussian. In practice we found that sampling N=10 latents for estimating the density to work well in practice.

When training the VAE alongside RL, we found the following two schedules to be effective for different environments:

  1. 1.

    For first 5K steps: Train VAE using standard MLE training every 500 time steps for 1000 batches. After that, train VAE using Skew-Fit every 500 time steps for 200 batches.

  2. 2.

    For first 5K steps: Train VAE using standard MLE training every 500 time steps for 1000 batches. For the next 45K steps, train VAE using Skew-Fit every 500 steps for 200 batches. After that, train VAE using Skew-Fit every 1000 time steps for 200 batches.

We found that initially training the VAE without Skew-Fit improved the stability of the algorithm. This is due to the fact that density estimates under the VAE are constantly changing and inaccurate during the early phases of training. Therefore, it made little sense to use those estimates to prioritize goals early on in training. Instead, we simply train using MLE training for the first 5K timesteps, and after that we perform Skew-Fit according to the VAE schedules above. Table 2 lists the hyper-parameters that were shared across the continuous control experiments. Table 3 lists hyper-parameters specific to each environment. Additionally, subsection E.2 shows the combined RIG + Skew-Fit algorithm.

Figure 12: Example reached goals by Skew-Fit and RIG. The first column of each environment section specifies the target goal while the second and third columns show reached goals by Skew-Fit and RIG. Both methods learn how to reach goals close to the initial position, but only Skew-Fit learns to reach the more difficult goals.
Figure 13: Proposed goals from the VAE for RIG and with Skew-Fit on the Visual Pickup, Visual Pusher, and Visual Door environments. Standard RIG produces goals where the door is closed and the object and puck is in the same position, while RIG + Skew-Fit proposes goals with varied puck positions, occasional object goals in the air, and both open and closed door angles.
Figure 14: Proposed goals from the VAE for RIG (left) and with RIG + Skew-Fit (right) on the Real World Visual Door environment. Standard RIG produces goals where the door is closed while RIG + Skew-Fit proposes goals with both open and closed door angles.
Hyper-parameter Value
Algorithm TD3 Fujimoto et al. (2018)66 6 We expect similar performance had we used SAC.
# training batches per time step 1
Q network hidden sizes 400,300
Policy network hidden sizes 400,300
Q network and policy activation ReLU
Exploration Noise None
RL Batch Size 1024
Discount Factor 0.99
Path length 25
Reward Scaling 100
Number of steps per epoch 5000
Table 1: Hyper-parameters used for 2D RL experiment (Figure 7(a)).
Hyper-parameter Value Comments
# training batches per time step 2 Marginal improvements after 2
Exploration Noise None (SAC policy is stochastic) Did not tune
RL Batch Size 1024 smaller batch sizes work as well
VAE Batch Size 64 Did not tune
Discount Factor 0.99 Did not tune
Reward Scaling 1 Did not tune
Path length 100 Did not tune
Number of Latents for Estimating Density (N) 10 Marginal improvements beyond 10
Table 2: General hyper-parameters used for all continuous control experiments.
Hyper-parameter Visual Pusher Visual Door Visual Pickup Real World Visual Door
Path Length 50 100 50 100
β for β-VAE 20 20 30 60
Latent Dimension Size 4 16 16 16
α for Skew-Fit -1 -1/2 -1 -1/2
VAE Training Schedule 2 1 2 1
Sample Goals From pϕ pskewed pskewed pskewed
Table 3: Environment specific hyper-parameters
{algorithm}

Skew-Fit with RIG {algorithmic}[1] \REQUIREVAE encoder qϕ, VAE decoder pψ, policy πθ, goal-conditioned value function Qw, α, VAE Training Schedule. \FORn=0,,N-1 episodes \STATESample latent goal zg from pϕ which is the prior p(z) or pskewed. \STATESample initial state s0E. \FORt=0,,H-1 steps \STATEGet action atπθ(e(st),zg). \STATEGet next state st+1p(st,at). \STATEStore (st,at,st+1,zg) into replay buffer . \STATESample transition (s,a,s,zg). \STATEEncode z=e(s),z=e(s). \STATE(Probability 0.5) replace zg with zgp(z). \STATECompute new reward r=-||z-zg||. \STATEMinimize Bellman Error using (z,a,z,zg,r). \ENDFOR\FORt=0,,H-1 steps \FORi=0,,k-1 steps \STATESample future state shi, t<hiH-1. \STATEStore (st,at,st+1,e(shi)) into . \ENDFOR\ENDFOR\STATEConstruct skewed distribution pskewedt using data from using Equation 3 \IFtotal_steps<5000 \STATETrain β-VAE on data sampled uniformly from according to VAE Training Schedule \ELSE\STATETrain β-VAE on data sampled from using pskewed according to VAE Training Schedule \ENDIF\ENDFOR