Abstract
In standard reinforcement learning, each new skill requires amanuallydesigned reward function, which takes considerable manual effort andengineering. Selfsupervised goal setting has the potential to automate thisprocess, enabling an agent to propose its own goals and acquire skills thatachieve these goals. However, such methods typically rely on manuallydesignedgoal distributions, or heuristics to force the agent to explore a wide range ofstates. We propose a formal exploration objective for goalreaching policiesthat maximizes state coverage. We show that this objective is equivalent tomaximizing the entropy of the goal distribution together with goal reachingperformance, where goals correspond to entire states. We present an algorithmcalled SkewFit for learning such a maximumentropy goal distribution, and showthat under certain regularity conditions, our method converges to a uniformdistribution over the set of possible states, even when we do not know this setbeforehand. SkewFit enables selfsupervised agents to autonomously choose andpractice diverse goals. Our experiments show that it can learn a variety ofmanipulation tasks from images, including opening a door with a real robot,entirely from scratch and without any manuallydesigned reward function.
Quick Read (beta)
SkewFit: StateCovering SelfSupervised Reinforcement Learning
Abstract
In standard reinforcement learning, each new skill requires a manuallydesigned reward function, which takes considerable manual effort and engineering. Selfsupervised goal setting has the potential to automate this process, enabling an agent to propose its own goals and acquire skills that achieve these goals. However, such methods typically rely on manuallydesigned goal distributions, or heuristics to force the agent to explore a wide range of states. We propose a formal exploration objective for goalreaching policies that maximizes state coverage. We show that this objective is equivalent to maximizing the entropy of the goal distribution together with goal reaching performance, where goals correspond to entire states. We present an algorithm called SkewFit for learning such a maximumentropy goal distribution, and show that under certain regularity conditions, our method converges to a uniform distribution over the set of valid states, even when we do not know this set beforehand. SkewFit enables selfsupervised agents to autonomously choose and practice reaching diverse goals. Our experiments show that it can learn a variety of manipulation tasks from images, including opening a door with a real robot, entirely from scratch and without any manuallydesigned reward function.
SkewFit: StateCovering SelfSupervised Reinforcement Learning
Vitchyr H. Pong^{†}^{†}thanks: Equal contribution, Murtaza Dalal^{†}^{†}footnotemark: , Steven Lin^{†}^{†}footnotemark: , Ashvin Nair, Shikhar Bahl, Sergey Levine University of California, Berkeley {vitchyr,mdalal,stevenlin598,anair17,shikharbahl,svlevine}@berkeley.edu
noticebox[b]Preprint. Under review.\[email protected]
1 Introduction

Reinforcement learning (RL) provides an appealing formalism for automated learning of behavioral skills. However, if we want agents to learn large repertoires of behaviors, this framework proves inadequate: separately learning every potentially useful skill becomes prohibitively time consuming, both in terms of the experience required for the agent, and the effort required for the user to design reward functions for each behavior. What if we could instead design an unsupervised RL algorithm that automatically explores the environment and iteratively distills this experience into generalpurpose policies that can accomplish new userspecified tasks at test time?
For an agent to learn autonomously, it needs an exploration objective. An effective exploration scheme is one that visits as many states as possible, allowing a policy to autonomously prepare for any possible userspecified task that it might see at test time. We can formalize this objective as maximizing the entropy of the learned policy’s stat distribution $\mathscr{H}(\mathbf{S})$, since a policy that maximizes this objective should approach a uniform distribution over possible legal states. However, a shortcoming of this objective is that the resulting policy cannot be used to solve new tasks: it only knows how to maximize state entropy. Thus, if we want to develop principled unsupervised RL algorithms that result in useful policies, maximizing $\mathscr{H}(\mathbf{S})$ is not enough. We need a mechanism that allows us to control the resulting policy to achieve new goals at testtime.
We argue that this can be accomplished by performing goaldirected exploration. In addition to maximizing the state distribution entropy $\mathscr{H}(\mathbf{S})$, we should be able to control where the policy goes by giving it a goal $\mathbf{G}$ that corresponds to a desired state. Mathematically, this can be accomplished by stating that a goalconditioned policy should also minimize the conditional entropy over the states given a goal, $\mathscr{H}(\mathbf{S}\mid \mathbf{G})$. This objective provides us with a principled way for training a policy to explore all states (“maximize $\mathscr{H}(\mathbf{S})$”) such that the state that is reached can be controlled by commanding goals (“minimize $\mathscr{H}(\mathbf{S}\mid \mathbf{G})$”).
Directly using this objective is often intractable, since it requires optimizing the marginal state distribution of the policy. However, we can avoid this difficult optimization by noting that our objective is the mutual information between the state and the goal, $I(\mathbf{S};\mathbf{G})$, which can be written as:
$\mathscr{H}(\mathbf{S})\mathscr{H}(\mathbf{S}\mathbf{G})=I(\mathbf{S};\mathbf{G})=\mathscr{H}(\mathbf{G})\mathscr{H}(\mathbf{G}\mathbf{S}).$  (1) 
Equation 1 thus gives an equivalent objective for an unsupervised RL algorithm: the agent should set diverse goals for itself (“maximize $\mathscr{H}(\mathbf{G})$”) and learn how to reach them (“minimize $\mathscr{H}(\mathbf{G}\mid \mathbf{S})$”).
While the second term is the typical objective studied in goalconditioned RL (Kaelbling, 1993; Andrychowicz et al., 2017), maximizing the diversity of goals is crucial for effectively learning to reach all possible states. In a new environment, acquiring such a maximumentropy goal distribution is challenging: how can an agent set diverse goals when it does not even know what states exist?
In this paper, we answer this question by presenting SkewFit, a method for learning to model the uniform distribution over states, given only access to data collected by an autonomous goalconditioned policy. Our paper makes the following contributions. First, we propose a principled objective for unsupervised RL given by Equation 1. While a number of prior works ignore the $\mathscr{H}(\mathbf{G})$ term, we argue that jointly optimizing the entire quantity is needed to develop effective and useful exploration. Second, we propose SkewFit, a method that, under regularity conditions, provably maximizes $\mathscr{H}(\mathbf{G})$ even when the underlying state space is unknown. Third, we empirically demonstrate that, when combined with goalconditioned RL, SkewFit allows us to autonomously train goalconditioned policies that reach diverse states. We test this method on a variety of simulated visionbased robot tasks without any taskspecific reward function. In these experiments, SkewFit reaches substantially better final performance than prior methods, and learns much more quickly. We also demonstrate that our approach solves a realworld manipulation task, which requires a robot to learn to open a door, from scratch in about five hours, without any manuallydesigned reward function, directly from images.
2 Related Work
Many prior methods for training goalconditioned policies assume that a goal distribution is available to sample from during exploration (Kaelbling, 1993; Schaul et al., 2015; Andrychowicz et al., 2017; Pong et al., 2018). Other methods use data collected from a randomly initialized policy or heuristics based on data collected online to design a nonparametric (Colas et al., 2018b; WardeFarley et al., 2018; Florensa et al., 2018a) or parametric (Péré et al., 2018; Nair et al., 2018) goal distribution. We remark that WardeFarley et al. (2018) also motivate their work in terms of minimizing a lower bound for $\mathscr{H}(\mathbf{G}\mid \mathbf{S})$. Our work is complementary to these goalreaching methods: rather than focusing on how to train goalreaching policies, we propose a principled method for maximizing the entropy of a goal sampling distribution, $\mathscr{H}(\mathbf{G})$.
Our method learns without any task rewards, directly acquiring a policy that can be reused to reach userspecified goals. This stands in contrast to exploration methods that give bonus rewards based on state visitation frequency (Bellemare et al., 2016; Ostrovski et al., 2017; Tang et al., 2017; Savinov et al., 2018; Chentanez et al., 2005; Lopes et al., 2012; Stadie et al., 2016; Pathak et al., 2017; Burda et al., 2018, 2019; Mohamed & Rezende, 2015; Chentanez et al., 2005). While these methods can also be used without a task reward, they provide no mechanism for distilling the knowledge gained from visiting diverse states into flexible policies that can be applied to accomplish new goals at testtime: their policies visit novel states, and they quickly forget about them as other states become more novel.
Other prior methods extract reusable skills in the form of latentvariableconditioned policies, where latent variables can be interpreted as options (Sutton et al., 1999) or abstract skills (Hausman et al., 2018; Gupta et al., 2018b; Eysenbach et al., 2019; Gupta et al., 2018a; Florensa et al., 2017). The resulting skills may be diverse, but they have no grounded interpretation, while our method can be used immediately after unsupervised training to reach diverse userspecified goals.
Some prior methods propose to choose goals based on heuristics such as learning progress (Baranes & Oudeyer, 2012; Veeriah et al., 2018; Colas et al., 2018a), how offpolicy the goal is (Nachum et al., 2018), level of difficulty (Florensa et al., 2018b) or likelihood ranking (Zhao & Tresp, 2019). In contrast, our approach provides a principled framework for optimizing a concrete and wellmotivated exploration objective, and can be shown to maximize this objective under regularity assumptions. Hazan et al. (2018) also provably optimizes a wellmotivated exploration objective, but is limited to tabular MDPs, while ours is able to handle high dimensional settings such as visionbased continuous control. In order to maximize state entropy, our method employs a reweighting scheme when fitting the goal proposal model. Concurrent work has proposed a related prioritization based on ranking (Zhao & Tresp, 2019) but does not consider the unsupervised imagebased reinforcement learning setting. We compare against this method in Section D.3 of the Appendix and find that our method significantly outperforms it.
3 Problem Formulation
To ensure that an unsupervised reinforcement learning agent learns to reach all possible states in a controllable way, we maximize the mutual information between the state $\mathbf{S}$ and the goal $\mathbf{G}$, $I(\mathbf{S};\mathbf{G})$. As shown in Equation 1, this objective can be decomposed into two separate terms: minimizing $\mathscr{H}(\mathbf{G}\mid \mathbf{S})$ and maximizing $\mathscr{H}(\mathbf{G})$.
3.1 Minimizing $\mathscr{H}(\mathbf{G}\mid \mathbf{S})$: GoalConditioned RL
Standard RL considers a Markov decision process (MDP), which has a state space $\mathcal{S}$, action space $\mathcal{A}$, and unknown dynamics $\rho ({\mathbf{s}}_{t+1}\mid {\mathbf{s}}_{t},{\mathbf{a}}_{t}):\mathcal{S}\times \mathcal{S}\times \mathcal{A}\mapsto [0,+\mathrm{\infty})$. Goalconditioned RL also includes a goal space $\mathcal{G}$, which we assume to be the same as the state space, $\mathcal{G}=\mathcal{S}$. ^{1}^{1} 1 We note that goalconditioned RL can always be formulated as a standard RL problem by appending the goal to the state. Additionally, some authors define the goal as a feature of the state. It is straightforward to apply the analysis and method presented in this paper to this setting. A goalconditioned policy $\pi (\mathbf{a}\mid \mathbf{s},\mathbf{g})$ maps a state $\mathbf{s}\in \mathcal{S}$ and goal $\mathbf{g}\in \mathcal{S}$ to a distribution over actions $\mathbf{a}\in \mathcal{A}$, and its objective is to reach the goal, i.e., to make the current state equal to the goal.
Goalreaching can be formulated as minimizing $\mathscr{H}(\mathbf{G}\mid \mathbf{S})$, and many practical goalreaching algorithms (Kaelbling, 1993; Lillicrap et al., 2016; Schaul et al., 2015; Andrychowicz et al., 2017; Nair et al., 2018; Pong et al., 2018; Florensa et al., 2018a) can be viewed as approximations of this. See Appendix A for more details. Our method may thus be used in conjunction with any of these prior goalconditioned RL methods in order to jointly minimize $\mathscr{H}(\mathbf{G}\mid \mathbf{S})$ and maximize $\mathscr{H}(\mathbf{G})$.
3.2 Maximizing $\mathscr{H}(\mathbf{G})$: Setting Diverse Goals
We now turn to the problem of setting diverse goals or, mathematically, maximizing the entropy of the goal distribution $\mathscr{H}(\mathbf{G})$. Let ${U}_{\mathcal{S}}$ be the uniform distribution over $\mathcal{S}$, where we assume $\mathcal{S}$ has finite volume so that the uniform distribution is welldefined. Let ${p}_{\varphi}$ be the goal distribution, i.e. $\mathbf{G}\sim {p}_{\varphi}$. Our goal is to maximize the entropy of ${p}_{\varphi}$, which we write as $\mathscr{H}(\mathbf{G})$. Since the maximum entropy distribution over $\mathcal{S}$ is the uniform distribution ${U}_{\mathcal{S}}$, maximizing $\mathscr{H}(\mathbf{G})$ may seem as simple as choosing the uniform distribution to be our goal distribution: ${p}_{\varphi}={U}_{\mathcal{S}}$. However, this requires knowing the uniform distribution over valid states, which may be difficult to obtain when $\mathcal{S}$ is a strict subset of ${\mathbb{R}}^{n}$, for some $n$. For example, if the states correspond to images viewed through a robot’s camera, $\mathcal{S}$ corresponds to the (unknown) set of valid images of the robot’s environment, while ${\mathbb{R}}^{n}$ corresponds to all possible arrays of pixel values of a particular size. In such environments, sampling from the uniform distribution ${\mathbb{R}}^{n}$ is unlikely to correspond to a valid image of the real world. Sampling uniformly from $\mathcal{S}$ would require knowing the set of all possible valid images, which we assume the agent does not know when starting to explore the environment.
While we cannot sample arbitrary states from $\mathcal{S}$, we can sample states by performing goaldirected exploration. To derive and analyze our method, we introduce a simple model of this process: a goal $\mathbf{G}\sim {p}_{\varphi}$ is sampled from the goal distribution ${p}_{\varphi}$, and then the agent attempts to achieve this goal, which results in a distribution of states $\mathbf{S}\in \mathcal{S}$ seen along the trajectory. We abstract this entire process by writing the resulting marginal distribution over $\mathbf{S}$ as $p(\mathbf{S}\mid {p}_{\varphi})$. We assume that $p(\mathbf{S}\mid {p}_{\varphi})$ has full support, which can be accomplished with an epsilongreedy goal reaching policy in a communicating MDP. We also assume we have access to an oracle goal reacher, so that $p(\mathbf{S}\mid {p}_{\varphi})\approx {p}_{\varphi}(\mathbf{S})$. This simplified model allows us to analyze the behavior of our goalsetting scheme in isolation of any goalreaching algorithm, but we show in Section 6 that we can jointly learn a goalreaching policy.
In summary, we want to acquire a maximumentropy goal distribution ${p}_{\varphi}$ over valid states $\mathcal{S}$, while only having access to state samples from $p(\mathbf{S}\mid {p}_{\varphi})$.
4 SkewFit: Learning a Maximum Entropy Goal Distribution
Our method, SkewFit, learns a maximum entropy goal distribution ${p}_{\varphi}$ using samples collected from a goalconditioned policy. We analyze the algorithm and show that SkewFit maximizes the entropy of the goal distribution, and present a practical instantiation for unsupervised deep RL.
4.1 SkewFit Algorithm
To learn a uniform distribution over valid goal states, we present a method that iteratively increases the entropy of a generative model ${p}_{\varphi}$. In particular, given a generative model ${p}_{{\varphi}_{t}}$ at iteration $t$, we would like to train a new generative model ${p}_{{\varphi}_{t+1}}$ such that ${p}_{{\varphi}_{t+1}}$ has higher entropy over the set of valid states. While we do not know the set of valid states $\mathcal{S}$, we can sample states from $p(\mathbf{S}\mid {p}_{{\varphi}_{t}})$, resulting in an empirical distribution ${p}_{{\text{emp}}_{t}}$ over the states
${p}_{{\text{emp}}_{t}}(\mathbf{s})={\displaystyle \frac{1}{N}}{\displaystyle \sum _{n=1}^{N}}\mathrm{\U0001d7cf}\{\mathbf{s}={\mathbf{S}}_{n}\},{\mathbf{S}}_{n}\sim p(\mathbf{S}\mid {p}_{{\varphi}_{t}}),$  (2) 
and use this empirical distribution to train the next generative model ${p}_{{\varphi}_{t+1}}$. However, if we simply train ${p}_{{\varphi}_{t+1}}$ to model this empirical distribution, it may not necessarily have higher entropy than ${p}_{{\varphi}_{t}}$.
The intuition behind our method is quite simple: rather than fitting a generative model to our empirical distribution, we skew the empirical distribution so that rarely visited states are given more weight. See Figure 2 for a visualization of this process.
How should we skew the empirical distribution if we want to maximize the entropy of ${p}_{{\varphi}_{t+1}}$? If we had access to the density of each state, ${p}_{{\text{emp}}_{t}}(\mathbf{S})$, then we could simply weight each state by $1/{p}_{{\text{emp}}_{t}}(\mathbf{S})$. We could then perform maximum likelihood estimation (MLE) for the uniform distribution by using the following loss to train ${\varphi}_{t+1}$:
$\mathcal{L}(\varphi )$  $={\mathbb{E}}_{\mathbf{S}\sim {U}_{\mathcal{S}}}\left[\mathrm{log}{p}_{\varphi}(\mathbf{S})\right]={\mathbb{E}}_{\mathbf{S}\sim {p}_{{\text{emp}}_{t}}}\left[{\displaystyle \frac{{U}_{\mathcal{S}}(\mathbf{S})}{{p}_{{\text{emp}}_{t}}(\mathbf{S})}}\mathrm{log}{p}_{\varphi}(\mathbf{S})\right]\propto {\mathbb{E}}_{\mathbf{S}\sim {p}_{{\text{emp}}_{t}}}\left[{\displaystyle \frac{1}{{p}_{{\text{emp}}_{t}}(\mathbf{S})}}\mathrm{log}{p}_{\varphi}(\mathbf{S})\right]$ 
where we use the fact that the uniform distribution ${U}_{\mathcal{S}}(\mathbf{S})$ has constant density for all states in $\mathcal{S}$. However, computing this density ${p}_{{\text{emp}}_{t}}(\mathbf{S})$ requires marginalizing out the MDP dynamics, which requires an accurate model of both the dynamics and the goalconditioned policy. We avoid needing to model the entire MDP process by approximating ${p}_{{\text{emp}}_{t}}(\mathbf{S})$ with our previous learned generative model: ${p}_{{\text{emp}}_{t}}(\mathbf{S})\approx p(\mathbf{S}\mid {p}_{{\varphi}_{t}})\approx {p}_{{\varphi}_{t}}(\mathbf{S})$.
Variance minimization.
This procedure relies on importance sampling (IS), which can have high variance, particularly if ${p}_{{\varphi}_{t}}(\mathbf{S})\approx 0$. We reduce this variance by weighing each state by ${p}_{{\varphi}_{t}}{(\mathbf{S})}^{\alpha}$, for $\alpha \in [1,0)$ rather than ${p}_{\varphi}{(\mathbf{S})}^{1}$. If $\alpha =1$, then we recover the exact importance sampling procedure described above. If $\alpha =0$, then this skew step has no effect. By choosing intermediate values of $\alpha $, we can trade off the variance introduced from dividing the weight by a potentially small ${p}_{{\varphi}_{t}}(\mathbf{S})$ with the speed at which we want to increase the entropy of the goal distribution. Moreover, we show in Section 4.2 that, for a range of values of $\alpha \in [1,0)$, this procedure will converge to the uniform distribution over valid states. Therefore, we explicitly define a skewed distribution using a weight with exponent $\alpha $:
${p}_{{\text{skewed}}_{t}}(\mathbf{s})$  $={\displaystyle \frac{1}{{Z}_{\alpha}}}{p}_{{\text{emp}}_{t}}(\mathbf{s}){p}_{{\varphi}_{t}}{(\mathbf{s})}^{\alpha},\alpha \in [1,0),{Z}_{\alpha}={\displaystyle \sum _{n=1}^{N}}{p}_{{\text{emp}}_{t}}({\mathbf{S}}_{n}){p}_{{\varphi}_{t}}{({\mathbf{S}}_{n})}^{\alpha},$  (3) 
where ${Z}_{\alpha}$ is the normalizing coefficient and ${p}_{{\text{emp}}_{t}}$ is given by Equation 2. Lastly, we fit the generative model at the next iteration ${p}_{{\varphi}_{t+1}}$ to ${p}_{{\text{skewed}}_{t}}$ using standard MLE. We note that computing ${Z}_{\alpha}$ does not add much computational overhead, since all importance weights already need to be computed.
Goal sampling alternative.
Because ${p}_{{\varphi}_{t+1}}\approx {p}_{{\text{skewed}}_{t}}$, at iteration $t+1$, one can sample goals from either ${p}_{{\varphi}_{t+1}}$ or ${p}_{{\text{skewed}}_{t}}$. Sampling goals from ${p}_{{\text{skewed}}_{t}}$ may be preferred if sampling from the learned generative model ${p}_{{\varphi}_{t+1}}$ is computationally or otherwise challenging. In either case, one still needs to train the generative model ${p}_{{\varphi}_{t}}$ to create ${p}_{{\text{skewed}}_{t}}$. In our experiments, we found that both methods perform well.
Overall, SkewFit samples data from the environment and weights different samples by their density under the generative model $p_{\varphi}{}_{t}$. We prove in the next section that this weighting makes the generative model at the next iteration $p_{\varphi}{}_{t+1}$ have higher entropy. By doing so, the generative model is more likely to generate goals at the frontier of unseen states, which results in more uniform state coverage. This procedure is shown in Figure 2 and SkewFit is summarized in Algorithm 4.1.
{algorithmic}[1] \FORIteration $t=1,2,\mathrm{\dots}$ \STATECollect $N$ states ${\{{\mathbf{S}}_{i}\}}_{i=1}^{N}$ by sampling goals from ${p}_{{\varphi}_{t}}$ (or ${p}_{{\text{skewed}}_{t}}$) and running goalconditioned policy. \STATEConstruct skewed distribution ${p}_{{\text{skewed}}_{t}}$ (Equation 3). \STATEFit ${p}_{{\varphi}_{t+1}}$ to skewed distribution ${p}_{{\text{skewed}}_{t}}$ using MLE. \ENDFOR
4.2 SkewFit Analysis
In this section, we provide conditions under which ${p}_{{\text{skewed}}_{t}}$ converges to the uniform distribution over the state space $\mathcal{S}$. Our most general result is stated as follows:
Lemma 4.1.
Let $\mathrm{S}$ be a compact set. Define the set of distributions $\mathrm{Q}\mathrm{=}\mathrm{\{}p\mathrm{:}\mathit{\text{support of}}\mathit{}p\mathit{}\mathit{\text{is}}\mathit{}\mathrm{S}\mathrm{\}}$. Let $\mathrm{F}\mathrm{:}\mathrm{Q}\mathrm{\mapsto}\mathrm{Q}$ be a continuous function and such that $\mathrm{H}\mathit{}\mathrm{(}\mathrm{F}\mathit{}\mathrm{(}p\mathrm{)}\mathrm{)}\mathrm{\ge}\mathrm{H}\mathit{}\mathrm{(}p\mathrm{)}$ with equality if and only if $p$ is the uniform probability distribution on $\mathrm{S}$, ${U}_{\mathrm{S}}$. Define the sequence of distributions $P\mathrm{=}\mathrm{(}{p}_{\mathrm{1}}\mathrm{,}{p}_{\mathrm{2}}\mathrm{,}\mathrm{\dots}\mathrm{)}$ by starting with any ${p}_{\mathrm{1}}\mathrm{\in}\mathrm{Q}$ and recursively defining ${p}_{t\mathrm{+}\mathrm{1}}\mathrm{=}\mathrm{F}\mathit{}\mathrm{(}{p}_{t}\mathrm{)}$.
The sequence $P$ converges to ${U}_{\mathrm{S}}$.
Proof.
See Appendix Section A. ∎
We will apply Lemma 4.1 to be the map from ${p}_{{\text{skewed}}_{t}}$ to ${p}_{{\text{skewed}}_{t+1}}$ to show that ${p}_{{\text{skewed}}_{t}}$ converges to ${U}_{\mathcal{S}}$^{2}^{2} 2 We take $N\to \mathrm{\infty}$ and refer to convergence in distribution.. If we assume that the goalconditioned policy and generative model learning procedure are well behaved (technical assumptions are stated in B.1 of the Appendix), then to apply Lemma 4.1, we only need to show that $\mathscr{H}({p}_{{\text{skewed}}_{t}})\ge \mathscr{H}({p}_{{\text{emp}}_{t}})$ with equality if and only if ${p}_{{\text{emp}}_{t}}={U}_{\mathcal{S}}$. For the simple case when ${p}_{{\varphi}_{t}}={p}_{{\text{emp}}_{t}}$ identically at each iteration, we prove that this is true for any value of $\alpha \in [1,0)$ in Lemma B.3 of the Appendix. However, in practice, ${p}_{{\varphi}_{t}}$ only approximates ${p}_{{\text{emp}}_{t}}$. To address this more realistic situation, we prove the following result:
Lemma 4.2.
Given two distribution ${p}_{{\text{\mathit{e}\mathit{m}\mathit{p}}}_{t}}$ and ${p}_{{\varphi}_{t}}$ where ${p}_{{\text{\mathit{e}\mathit{m}\mathit{p}}}_{t}}\mathrm{\ll}{p}_{{\varphi}_{t}}$ ^{3}^{3} 3 $p\mathrm{\ll}q$ means that $p$ is absolutely continuous with respect to $q$, i.e. $p\mathit{}\mathrm{(}\mathrm{s}\mathrm{)}\mathrm{=}\mathrm{0}\mathrm{\u27f9}q\mathit{}\mathrm{(}\mathrm{s}\mathrm{)}\mathrm{=}\mathrm{0}$. and
${\mathrm{Cov}}_{\mathbf{S}\sim {p}_{{\text{\mathit{e}\mathit{m}\mathit{p}}}_{t}}}[\mathrm{log}{p}_{{\text{\mathit{e}\mathit{m}\mathit{p}}}_{t}}(\mathbf{S}),\mathrm{log}{p}_{{\varphi}_{t}}(\mathbf{S})]>0,$  (4) 
define the distribution ${p}_{{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{w}\mathit{e}\mathit{d}}}_{t}}$ as in Equation 3. Let ${\mathrm{H}}_{\alpha}\mathit{}\mathrm{(}\alpha \mathrm{)}$ be the entropy of ${p}_{{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{w}\mathit{e}\mathit{d}}}_{t}}$ for a fixed $\alpha $. Then there exists a constant $$ such that for all $\alpha \mathrm{\in}\mathrm{[}a\mathrm{,}\mathrm{0}\mathrm{)}$,
$\mathscr{H}({p}_{{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{w}\mathit{e}\mathit{d}}}_{t}})={\mathscr{H}}_{\alpha}(\alpha )>\mathscr{H}({p}_{{\text{\mathit{e}\mathit{m}\mathit{p}}}_{t}}).$ 
Proof.
See Appendix Section A. ∎
Thus, our generative model ${p}_{{\varphi}_{t}}$ does not need to exactly fit the empirical distribution. We merely need for the log densities of ${p}_{{\varphi}_{t}}$ and ${p}_{{\text{emp}}_{t}}$ to be correlated, which we expect to happen frequently with an accurate goalconditioned policy, since ${p}_{{\text{emp}}_{t}}$ is the set of states seen when trying to reach goals from ${p}_{{\varphi}_{t}}$. In this case, if we choose negative values of $\alpha $ that are small enough, then the entropy of ${p}_{{\text{skewed}}_{t}}$ will be higher than that of ${p}_{{\text{emp}}_{t}}$. Empirically, we found that $\alpha $ values as low as $\alpha =1$ performed well.
In summary, we see that under certain assumptions, ${p}_{{\text{skewed}}_{t}}$ converges to ${U}_{\mathcal{S}}$. Since we train each generative model ${p}_{{\varphi}_{t+1}}$ by fitting it to ${p}_{{\text{skewed}}_{t}}$, we expect ${p}_{{\varphi}_{t}}$ to also converge to ${U}_{\mathcal{S}}$.
5 SkewFit with GoalConditioned Reinforcement Learning
To train agents that can autonomously explore the environment in a controllable manner, we need to maximize the mutual information between goals and states, $I(\mathbf{S};\mathbf{G})=\mathscr{H}(\mathbf{G})\mathscr{H}(\mathbf{G}\mid \mathbf{S})$. Thus far, we have presented and derived SkewFit assuming that we have access to an oracle goalreaching policy, allowing us to separately analyze how we can maximize $\mathscr{H}(\mathbf{G})$. However, in practice we do not have access to such an oracle, and in this section we discuss how one can combine SkewFit with existing goalconditioned reinforcement learning to maximize both of these terms jointly.
Maximizing $I(\mathbf{S};\mathbf{G})$ can be done by simultaneously performing SkewFit and training a goal conditioned policy to minimize $\mathscr{H}(\mathbf{G}\mid \mathbf{S})$, or, equivalently, maximize $\mathscr{H}(\mathbf{G}\mid \mathbf{S})$. Maximizing $\mathscr{H}(\mathbf{G}\mid \mathbf{S})$ requires computing the density $\mathrm{log}p(\mathbf{G}\mid \mathbf{S})$, which may be difficult to compute without strong modeling assumptions. However, for any distribution $q$, the following lower bound for $\mathscr{H}(\mathbf{G}\mid \mathbf{S})$ holds:
$\mathscr{H}(\mathbf{G}\mid \mathbf{S})$  $={\mathbb{E}}_{(\mathbf{G},\mathbf{S})\sim {p}_{{\varphi}_{t}},\pi}\left[\mathrm{log}q(\mathbf{G}\mid \mathbf{S})\right]+{D}_{\mathrm{KL}}(p\mid q)\ge {\mathbb{E}}_{(\mathbf{G},\mathbf{S})\sim {p}_{{\varphi}_{t}},\pi}\left[\mathrm{log}q(\mathbf{G}\mid \mathbf{S})\right],$ 
where ${D}_{\mathrm{KL}}$ denotes Kullback–Leibler divergence as discussed by Barber & Agakov (2004). Thus, to minimize $\mathscr{H}(\mathbf{G}\mid \mathbf{S})$, we train a policy to maximize the following reward:
$r(\mathbf{S},\mathbf{G})=\mathrm{log}q(\mathbf{G}\mid \mathbf{S}).$ 
Since SkewFit uses a generative model to propose goals, it is particularly natural to combine with reinforcement learning with imagined goals (RIG) Nair et al. (2018), though in principle any goalconditioned method could be used. RIG is an efficient offpolicy goalconditioned method that solves the visionbased RL problem in a learned latent space. In particular, RIG fits a VAE and uses it to encode all observations and goals into a latent space. RIG also uses the generative model for both goal sampling and compute rewards, $\mathrm{log}q(\mathbf{G}\mid \mathbf{S})$. Applying SkewFit to RIG then amounts to using SkewFit rather than MLE to train the VAE. See Appendix E for details on how to do so.
6 Experiments
Our experimental evaluation of SkewFit aims to study the following empirical questions: (1) Can SkewFit learn a generative model that can model the uniform distribution over the set of valid states? (2) When combined with goalconditioned RL, can SkewFit enable agents to autonomously set and learn to reach a diverse set of goals?
To answer these questions, we evaluate SkewFit with RIG on visionbased continuous control. In this setting, the agent must control a robot arm using only image observations, without access to any ground truth reward signal. We test our method on three different simulated continuous control tasks: Visual Door, Visual Pusher, and Visual Pickup. See Figure 3 for visuals and Section E for details of these environments. Training policies for these tasks is done in a completely unsupervised manner without access to either any prior information about the statespace or an oracle goalsampling distribution. However, to evaluate their performance, we sample goal images from a uniform distribution over valid images and report the agent’s final distance to the corresponding simulator states (e.g. distance of the puck to the target puck location). While this evaluation method and metric is only practical in simulation, it provides us with a quantitative metric for measuring an RL method’s ability to reach a broad coverage of goals in a visionbased setting.
To train policies in imagebased environments, for each method we compare to, with the exception of WardeFarley et al. (2018), we apply it alongside RIG in order to enable them to solve visual tasks and ensure a fair comparison across methods. Additionally, we replace TD3(Fujimoto et al., 2018) with soft actor critic (SAC) Haarnoja et al. (2018) for all the methods that use RIG including SkewFit. We found that SAC performed as well or better than TD3 in every environment. In RIG, the VAE was pretrained on a uniform sampling of images from the state space of each environment. In order to ensure a fair comparison to SkewFit, we forego pretraining and instead train the VAE alongside RL, using the variant described in the RIG paper.
First, we compare to standard RIG, without SkewFit. We ran RIG using the relabeling scheme described in the hindsight experience replay (HER) paper to form the baseline, HER. Florensa et al. (2018b) samples goals from a GAN based on the difficulty of reaching the goal. We compare against this method by replacing ${p}_{\varphi}$ with the GAN and label it AutoGoal GAN. We compare to WardeFarley et al. (2018), which uses a nonparametric approach based on clustering to sample goals and a state discriminator to compute rewards. We denote this baseline as DISCERN. We also compare to the goal proposal mechanism proposed by WardeFarley et al. (2018) without the discriminative reward in DISCERN, which we label DISCERNg. Lastly, we compare our method to two exploration methods based on reward bonuses: ICM (Pathak et al., 2017), and # Exploration (Tang et al., 2017), denoted ICM and #Exploration respectively.
We see in Figure 4 that SkewFit significantly outperforms prior methods both in terms of task performance and sample complexity. The most common failure mode for prior methods is that the goal sampling distribution collapses, resulting in the agent learning to reach only a fraction of the state space. For comparison, more goals sampled from the generative model trained with SkewFit and RIG are shown in Figure 13 in the Appendix. We can see that standard RIG produces a small nondiverse distribution for each environment: the object is in the same place for pickup, the puck is often in the starting position for pushing, and the door is always closed.
In contrast, SkewFit proposes goals where the object is in the air and on the ground, varied puck positions, and both open and closed door angles. The direct effect of these goal choices can be seen in Figure 12 in the appendix, which shows example reached goals for RIG and SkewFit. Standard RIG only learns to reach states close to the initial position, while SkewFit learns to reach the entire state space. A quantitative comparison between the two methods on the pickup task can be seen in Figure 10 in the appendix, which gives the cumulative total exploration pickups for each method. From the graph, we can see that only SkewFit learns to pay attention to the object, while other methods ignore it completely.
Realworld visionbased robotic manipulation
Finally, we demonstrate that SkewFit scales well to the real world with a door opening task, Real World Visual Door. See Figure 3 for a picture of this environment. While a number of prior works have studied RLbased learning of door opening Kalakrishnan et al. (2011); Chebotar et al. (2017), we demonstrate the first method for autonomous learning of door opening without a userprovided, taskspecific reward function. As in simulation, we do not provide any goals to the agent and simply let it interact with the door to solve the door opening task from scratch, without any human guidance or reward signal. We train agents using SkewFit as well as RIG. Unlike in simulation, we cannot measure the difference between the policy’s achieved and desired door angle since we do not have access to the true state of the world. Instead, we simply visually denote a binary success/failure for each goal based on whether the last state in the trajectory achieves the target angle. Every seven and a half minutes of interaction time we evaluate on $5$ goals and plot the cumulative successes for each method. As Figure 5 shows, standard RIG only starts to open the door consistently after five hours of training. In contrast, SkewFit learns to open the door after three hours of training and achieves a perfect success rate after five and a half hours of interaction time, demonstrating that SkewFit is a promising technique for solving real world tasks without any humanprovided reward function. Videos of our method solving this task, along with the simulated environments, can be viewed on our website.^{4}^{4} 4 https://sites.google.com/view/skewfit
Additional Experiments
We perform additional experiments including a more thorough analysis in a simple 2D setting, a study of how SkewFit can be used to train generative models on imbalanced datasets, and a sensitivity analysis over the parameter $\alpha $, in Appendix D. Additionally, Appendix E provides a complete description of all our environments as well as our method hyperparameters.
7 Conclusion
We presented SkewFit, an algorithm for training a generative model to closely approximate a uniform distribution over valid states, using data obtained via goalconditioned reinforcement learning. Our method iteratively reweights the samples for training the generative model, such that its entropy increases over the set of possible states, and our theoretical analysis gives conditions under which SkewFit converges to the uniform distribution. When such a model is used to choose goals for exploration and to relabeling goals for training, the resulting method results in much better coverage of the state space, enabling our method to explore effectively. Our experiments show that it produces quantifiable improvements when used along with goalconditioned reinforcement learning on simulated robotic manipulation tasks, and can be used to learn a complex door opening skill to reach a $100\%$ success rate directly on a real robot, without any humanprovided reward supervision.
8 Acknowledgement
This research was supported by Berkeley DeepDrive, Huawei, ARL DCIST CRA W911NF1720181, NSF IIS1651843, and the Office of Naval Research, as well as Amazon, Google, and NVIDIA. We thank Aviral Kumar, Carlos Florensa, Aurick Zhou, Nilesh Tripuraneni, Vickie Ye, Dibya Ghosh, Coline Devin, and John D. CoReyes for their insightful discussions and feedback.
References
 Andrychowicz et al. (2017) Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., Mcgrew, B., Tobin, J., Abbeel, P., and Zaremba, W. Hindsight Experience Replay. In Advances in Neural Information Processing Systems (NIPS), 2017.
 Baranes & Oudeyer (2012) Baranes, A. and Oudeyer, P.Y. Active Learning of Inverse Models with Intrinsically Motivated Goal Exploration in Robots. Robotics and Autonomous Systems, 61(1):49–73, 2012. doi: 10.1016/j.robot.2012.05.008.
 Barber & Agakov (2004) Barber, D. and Agakov, F. V. Information maximization in noisy channels: A variational approach. In Advances in Neural Information Processing Systems, pp. 201–208, 2004.
 Bellemare et al. (2016) Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. Unifying countbased exploration and intrinsic motivation. In Advances in Neural Information Processing Systems (NIPS), pp. 1471–1479, 2016.
 Billingsley (2013) Billingsley, P. Convergence of probability measures. John Wiley & Sons, 2013.
 Burda et al. (2018) Burda, Y., Edwards, H., Storkey, A., and Klimov, O. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
 Burda et al. (2019) Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., and Efros, A. A. Largescale study of curiositydriven learning. In International Conference on Learning Representations (ICLR), 2019.
 Chebotar et al. (2017) Chebotar, Y., Kalakrishnan, M., Yahya, A., Li, A., Schaal, S., and Levine, S. Path integral guided policy search. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3381–3388. IEEE, 2017.
 Chentanez et al. (2005) Chentanez, N., Barto, A. G., and Singh, S. P. Intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pp. 1281–1288, 2005.
 Colas et al. (2018a) Colas, C., Fournier, P., Sigaud, O., and Oudeyer, P. CURIOUS: intrinsically motivated multitask, multigoal reinforcement learning. CoRR, abs/1810.06284, 2018a.
 Colas et al. (2018b) Colas, C., Sigaud, O., and Oudeyer, P.Y. Geppg: Decoupling exploration and exploitation in deep reinforcement learning algorithms. International Conference on Machine Learning (ICML), 2018b.
 Eysenbach et al. (2019) Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. Diversity is All You Need: Learning Skills without a Reward Function. In International Conference on Learning Representations (ICLR), 2019.
 Florensa et al. (2017) Florensa, C., Duan, Y., and Abbeel, P. Stochastic neural networks for hierarchical reinforcement learning. In International Conference on Learning Representations (ICLR), 2017.
 Florensa et al. (2018a) Florensa, C., Degrave, J., Heess, N., Springenberg, J. T., and Riedmiller, M. Selfsupervised Learning of Image Embedding for Continuous Control. In Workshop on Inference to Control at NeurIPS, 2018a.
 Florensa et al. (2018b) Florensa, C., Held, D., Geng, X., and Abbeel, P. Automatic Goal Generation for Reinforcement Learning Agents. In International Conference on Machine Learning (ICML), 2018b.
 Fujimoto et al. (2018) Fujimoto, S., van Hoof, H., and Meger, D. Addressing Function Approximation Error in ActorCritic Methods. In International Conference on Machine Learning (ICML), 2018.
 Gupta et al. (2018a) Gupta, A., Eysenbach, B., Finn, C., and Levine, S. Unsupervised metalearning for reinforcement learning. CoRR, abs:1806.04640, 2018a.
 Gupta et al. (2018b) Gupta, A., Mendonca, R., Liu, Y., Abbeel, P., and Levine, S. MetaReinforcement Learning of Structured Exploration Strategies. In Advances in Neural Information Processing Systems (NIPS), 2018b.
 Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., and Levine, S. Soft actorcritic algorithms and applications. CoRR, abs/1812.05905, 2018.
 Hausman et al. (2018) Hausman, K., Springenberg, J. T., Wang, Z., Heess, N., and Riedmiller, M. Learning an Embedding Space for Transferable Robot Skills. In International Conference on Learning Representations (ICLR), pp. 1–16, 2018.
 Hazan et al. (2018) Hazan, E., Kakade, S. M., Singh, K., and Soest, A. V. Provably efficient maximum entropy exploration. CoRR, abs/1812.02690, 2018.
 Kaelbling (1993) Kaelbling, L. P. Learning to achieve goals. In International Joint Conference on Artificial Intelligence (IJCAI), volume vol.2, pp. 1094 – 8, 1993.
 Kalakrishnan et al. (2011) Kalakrishnan, M., Righetti, L., Pastor, P., and Schaal, S. Learning force control policies for compliant manipulation. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4639–4644. IEEE, 2011.
 Kingma & Welling (2014) Kingma, D. P. and Welling, M. AutoEncoding Variational Bayes. In International Conference on Learning Representations (ICLR), 2014.
 Lillicrap et al. (2016) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016. ISBN 078033213X. doi: 10.1613/jair.301.
 Lopes et al. (2012) Lopes, M., Lang, T., Toussaint, M., and Oudeyer, P.Y. Exploration in modelbased reinforcement learning by empirically estimating learning progress. In Advances in Neural Information Processing Systems, pp. 206–214, 2012.
 Mohamed & Rezende (2015) Mohamed, S. and Rezende, D. J. Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pp. 2125–2133, 2015.
 Nachum et al. (2018) Nachum, O., Brain, G., Gu, S., Lee, H., and Levine, S. DataEfficient Hierarchical Reinforcement Learning. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
 Nair et al. (2018) Nair, A., Pong, V., Dalal, M., Bahl, S., Lin, S., and Levine, S. Visual Reinforcement Learning with Imagined Goals. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
 Nielsen & Nock (2010) Nielsen, F. and Nock, R. Entropies and crossentropies of exponential families. In Image Processing (ICIP), 2010 17th IEEE International Conference on, pp. 3621–3624. IEEE, 2010.
 Ostrovski et al. (2017) Ostrovski, G., Bellemare, M. G., Oord, A., and Munos, R. Countbased exploration with neural density models. In International Conference on Machine Learning, pp. 2721–2730, 2017.
 Pathak et al. (2017) Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. CuriosityDriven Exploration by SelfSupervised Prediction. In International Conference on Machine Learning (ICML), pp. 488–489. IEEE, 2017.
 Péré et al. (2018) Péré, A., Forestier, S., Sigaud, O., and Oudeyer, P.Y. Unsupervised Learning of Goal Spaces for Intrinsically Motivated Goal Exploration. In International Conference on Learning Representations (ICLR), 2018.
 Pong et al. (2018) Pong, V., Gu, S., Dalal, M., and Levine, S. Temporal Difference Models: ModelFree Deep RL For ModelBased Control. In International Conference on Learning Representations (ICLR), 2018.
 Savinov et al. (2018) Savinov, N., Raichuk, A., Marinier, R., Vincent, D., Pollefeys, M., Lillicrap, T., and Gelly, S. Episodic curiosity through reachability. arXiv preprint arXiv:1810.02274, 2018.
 Schaul et al. (2015) Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal Value Function Approximators. In International Conference on Machine Learning (ICML), pp. 1312–1320, 2015. ISBN 9781510810587.
 Stadie et al. (2016) Stadie, B. C., Levine, S., and Abbeel, P. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models. In International Conference on Learning Representations (ICLR), 2016.
 Sutton et al. (1999) Sutton, R. S., Precup, D., and Singh, S. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211, 1999.
 Tang et al. (2017) Tang, H., Houthooft, R., Foote, D., Stooke, A., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. #Exploration: A Study of CountBased Exploration for Deep Reinforcement Learning. In Neural Information Processing Systems (NIPS), 2017.
 Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. MuJoCo: A physics engine for modelbased control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5026–5033, 2012. ISBN 9781467317375. doi: 10.1109/IROS.2012.6386109.
 Veeriah et al. (2018) Veeriah, V., Oh, J., and Singh, S. Manygoals reinforcement learning. arXiv preprint arXiv:1806.09605, 2018.
 WardeFarley et al. (2018) WardeFarley, D., de Wiele, T. V., Kulkarni, T., Ionescu, C., Hansen, S., and Mnih, V. Unsupervised control through nonparametric discriminative rewards. CoRR, abs/1811.11359, 2018.
 Zhao & Tresp (2019) Zhao, R. and Tresp, V. Curiositydriven experience prioritization via density estimation. CoRR, abs/1902.08039, 2019.
Appendix A GoalConditioned RL minimizes $\mathscr{H}(\mathbf{G}\mid \mathbf{S})$
Some goalconditioned RL methods such as WardeFarley et al. (2018); Nair et al. (2018) present methods for minimizing a lower bound for $\mathscr{H}(\mathbf{G}\mid \mathbf{S})$, by approximating $\mathrm{log}p(\mathbf{G}\mid \mathbf{S})$ and using it as the reward. While, other goalconditioned RL methods (Kaelbling, 1993; Lillicrap et al., 2016; Schaul et al., 2015; Andrychowicz et al., 2017; Pong et al., 2018; Florensa et al., 2018a) are not developed with the intention of minimizing the conditional entropy $\mathscr{H}(\mathbf{G}\mid \mathbf{S})$, the optimal goalconditioned policy will deterministically reach the goal, resulting in a conditional entropy of zero: $\mathscr{H}(\mathbf{G}\mid \mathbf{S})=0$. Thus, goalconditioned RL methods effectively minimize $\mathscr{H}(\mathbf{G}\mid \mathbf{S})$.
Appendix B Analysis
We note several intermediate results that are necessary in the proof of SkewFit.
B.1 Proofs of Lemmas for the Analysis of SkewFit
Let $q\ll p$ means that $q$ is absolutely continuous with respect to $p$, i.e. $p(x)=0\u27f9q(x)=0$.
Assumption B.1.
Assume that

•
$\mathcal{S}$ is compact.

•
The maps from ${p}_{{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{w}\mathit{e}\mathit{d}}}_{t}}$ to ${p}_{{\varphi}_{t+1}}$ and ${p}_{{\varphi}_{t}}$ to ${p}_{{\text{\mathit{e}\mathit{m}\mathit{p}}}_{t}}$ are continuous and do not decrease entropy.

•
The support of ${p}_{{\varphi}_{t}}$ contains $\mathcal{S}$.
The assumption that $\mathcal{S}$ is compact is easily achieved in most application and makes ${U}_{\mathcal{S}}$ well defined.
The continuity assumption states that the method used to fit ${p}_{{\varphi}_{t+1}}$ and the goalconditioned policy are wellbehaved.
Note that we do not assume that these procedures for obtain ${p}_{{\varphi}_{t+1}}$ and ${p}_{{\text{emp}}_{t+1}}$ increase the entropy, and only require that they do not decrease the entropy. Making a statement without these entropy assumption would be difficult: if the goalconditioned policy ignored the goal and always entered the same state, or if the generative model only captured a single mode ${p}_{{\text{skewed}}_{t}}$, it would be challenging for any procedure to result in the uniform distribution.
Next, the assumption that the support of ${p}_{{\varphi}_{t}}$ contains $\mathcal{S}$ ensures that ${p}_{{\text{skewed}}_{t}}$ is well defined and a continuous function of ${p}_{{\text{emp}}_{t}}$ and ${p}_{{\varphi}_{t}}$. Note that ${p}_{{\varphi}_{t}}$ can have support larger than $\mathcal{S}$, and so we can choose to optimize ${p}_{{\varphi}_{t}}$ over any class of generative models with wide supports, without needing to know the manifold $\mathcal{S}$.
Under these assumptions, to use Lemma 4.1 for SkewFit, we must show $\mathscr{H}({p}_{{\text{skewed}}_{t}})\ge \mathscr{H}({p}_{{\text{emp}}_{t}})$ with equality if and only if ${p}_{{\text{emp}}_{t}}={U}_{\mathcal{S}}$.
Lemma B.2.
Let $\mathrm{S}$ be a compact set. Define the set of distributions $\mathrm{Q}\mathrm{=}\mathrm{\{}p\mathrm{:}\mathit{\text{support of}}\mathit{}p\mathit{}\mathit{\text{is}}\mathit{}\mathrm{S}\mathrm{\}}$. Let $\mathrm{F}\mathrm{:}\mathrm{Q}\mathrm{\mapsto}\mathrm{Q}$ be a continuous function and such that $\mathrm{H}\mathit{}\mathrm{(}\mathrm{F}\mathit{}\mathrm{(}p\mathrm{)}\mathrm{)}\mathrm{\ge}\mathrm{H}\mathit{}\mathrm{(}p\mathrm{)}$ with equality if and only if $p$ is the uniform probability distribution on $\mathrm{S}$, ${U}_{\mathrm{S}}$. Define the sequence of distributions $P\mathrm{=}\mathrm{(}{p}_{\mathrm{1}}\mathrm{,}{p}_{\mathrm{2}}\mathrm{,}\mathrm{\dots}\mathrm{)}$ by starting with any ${p}_{\mathrm{1}}\mathrm{\in}\mathrm{Q}$ and recursively defining ${p}_{t\mathrm{+}\mathrm{1}}\mathrm{=}\mathrm{F}\mathit{}\mathrm{(}{p}_{t}\mathrm{)}$.
The sequence $P$ converges to ${U}_{\mathrm{S}}$.
Proof.
The uniform distribution ${U}_{\mathcal{S}}$ is well defined since $\mathcal{S}$ is compact. Because $\mathcal{S}$ is a compact set, by Prokhorov’s Theorem Billingsley (2013), the set $\mathcal{Q}$ is sequentially compact. Thus, $P$ has a convergent subsequence ${P}^{\prime}=({p}_{{k}_{1}},{p}_{{k}_{2}},\mathrm{\dots})\subset P$ for $$ that converges to a distribution ${p}^{*}\in \mathcal{Q}$. Because $\mathcal{F}$ is continuous, ${p}^{*}$ must be a fixed point of $\mathcal{F}$ since by the convergence mapping theorem, we have that
$\underset{i\to \mathrm{\infty}}{lim}{p}_{{k}_{i}}={p}^{*}$  $\u27f9\underset{i\to \mathrm{\infty}}{lim}\mathcal{F}({p}_{{k}_{i}})=\mathscr{H}({p}^{*})$ 
and so
${p}^{*}$  $=\underset{i\to \mathrm{\infty}}{lim}{p}_{{k}_{i}}$  
$=\underset{i\to \mathrm{\infty}}{lim}\mathcal{F}({p}_{{k}_{i1}})$  
$=\mathscr{H}({p}^{*}).$ 
The only fixed point of $\mathcal{F}$ is ${U}_{\mathcal{S}}$ since for any distribution $p$ that is not the uniform distribution, ${U}_{\mathcal{S}}$, we have that $\mathscr{H}(\mathcal{F}(p))>\mathscr{H}(p)$ which implies that $\mathcal{F}(p)\ne p$. Thus, ${P}^{\prime}$ converges to the only fixed point, ${U}_{\mathcal{S}}$. Since the entropy cannot decrease, then entropy of the distributions in $P$ must also converge to the entropy of ${U}_{\mathcal{S}}$. Lastly, since entropy is a continuous function of distribution, $P$ must converge to ${U}_{\mathcal{S}}$. ∎
Lemma B.3.
Assume the set $\mathrm{S}$ has finite volume so that its uniform distribution ${U}_{\mathrm{S}}$ is well defined and has finite entropy. Given any distribution $p\mathit{}\mathrm{(}\mathrm{s}\mathrm{)}$ whose support is $\mathrm{S}$, recursively define ${p}_{\alpha}$
${p}_{\alpha}(\mathbf{s})$  $={\displaystyle \frac{1}{{Z}_{\alpha}}}p{(\mathbf{s})}^{\alpha},\forall \mathbf{s}\in \mathcal{S}$ 
where ${Z}_{\alpha}$ is the normalizing constant and $\alpha \mathrm{\in}\mathrm{[}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{)}$^{5}^{5} 5 In the paper, $\alpha \mathrm{\in}\mathrm{[}\mathrm{}\mathrm{1}\mathrm{,}\mathrm{0}\mathrm{)}$. However, when ${p}_{{\mathrm{\text{emp}}}_{t}}\mathrm{=}{p}_{{\varphi}_{t}}$, Equation 3 becomes $p_{\mathrm{\text{skewed}}}{}_{t}\mathit{}\mathrm{(}\mathrm{S}\mathrm{)}$ $\mathrm{=}{\displaystyle \frac{\mathrm{1}}{{Z}_{\alpha}}}\mathit{}{p}_{{\varphi}_{t}}\mathit{}\mathrm{(}\mathrm{S}\mathrm{)}\mathit{}{p}_{{\varphi}_{t}}\mathit{}{\mathrm{(}\mathrm{S}\mathrm{)}}^{\alpha}\mathrm{,}\alpha \mathrm{\in}\mathrm{[}\mathrm{}\mathrm{1}\mathrm{,}\mathrm{0}\mathrm{)}$ $\mathrm{=}{\displaystyle \frac{\mathrm{1}}{{Z}_{\alpha}}}\mathit{}{p}_{{\varphi}_{t}}\mathit{}{\mathrm{(}\mathrm{S}\mathrm{)}}^{{\alpha}^{\mathrm{\prime}}}\mathrm{,}{\alpha}^{\mathrm{\prime}}\mathrm{\in}\mathrm{[}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{)}$ .
For all $\alpha \mathrm{\in}\mathrm{[}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{)}$,
$\mathscr{H}({p}_{\alpha})\ge \mathscr{H}(p)$ 
with equality if and only if $p$ is ${U}_{\mathrm{S}}$, the uniform distribution $\mathrm{S}$.
Proof.
If $\alpha =0$ or $p$ is the uniform distribution, the result is clearly true. We now study the case where $\alpha \in (0,1)$ and $p\ne {U}_{\mathcal{A}}$.
Define the onedimensional exponential family $\{{p}_{\theta}^{t}:\alpha \in [0,1]\}$ where ${p}_{\theta}^{t}$ is
${p}_{\theta}^{t}(\mathbf{s})={e}^{\alpha T(\mathbf{s})A(\alpha )+k(\mathbf{s})}$ 
with log carrier density $k(\mathbf{s})=0$, natural parameter $\alpha $, sufficient statistic $T(\mathbf{s})=\mathrm{log}{p}_{t}(\mathbf{s})$, and lognormalizer $A(\alpha )={\int}_{\mathcal{S}}{e}^{\alpha T(\mathbf{s})}\mathit{d}\mathbf{s}$. As shown in Nielsen & Nock (2010), the entropy of a distribution from a onedimensional exponential family with parameter $\alpha $ is given by:
${\mathscr{H}}_{\theta}^{t}(\alpha )\triangleq \mathscr{H}({p}_{\theta}^{t})=A(\alpha )\alpha {A}^{\prime}(\alpha )$ 
The derivative with respect to $\alpha $ is then
$\frac{d}{d\alpha}}d{\mathscr{H}}_{\theta}^{t}(\alpha )$  $=\alpha {A}^{\prime \prime}(\alpha )$  
$=\alpha {\mathrm{Var}}_{\mathbf{s}\sim {p}_{\theta}^{t}}[T(\mathbf{s})]$  
$=\alpha {\mathrm{Var}}_{\mathbf{s}\sim {p}_{\theta}^{t}}[\mathrm{log}{p}_{t}(\mathbf{s})]$  
$\le 0$ 
where we use the fact that the $n$th derivative of $A(\alpha )$ is the $n$ central moment, i.e. ${A}^{\prime \prime}(\alpha )={\mathrm{Var}}_{\mathbf{s}\sim {p}_{\theta}^{t}}[T(\mathbf{s})]$. Since variance is always nonnegative, this means the entropy is monotonically decreasing with $\alpha $, and so
$\mathscr{H}({p}_{\alpha})\ge \mathscr{H}({p}_{1})=\mathscr{H}(p)$ 
with equality if and only if
${\mathrm{Var}}_{\mathbf{s}\sim {p}_{\theta}^{t}}[\mathrm{log}p(\mathbf{s})]=0.$ 
However, this only happens if $\mathrm{log}p(\mathbf{s})$ is constant over its support, i.e. it is the uniform distribution over its support. ∎
We also prove the convergence directly for the (even more) simplified case when ${p}_{{\text{skewed}}_{t}}={p}_{{\varphi}_{t+1}}={p}_{{\text{emp}}_{t+1}}$ using a similar technique:
Lemma B.4.
Assume the set $\mathrm{S}$ has finite volume so that its uniform distribution ${U}_{\mathrm{S}}$ is well defined and has finite entropy. Given any distribution $p\mathit{}\mathrm{(}\mathrm{s}\mathrm{)}$ whose support is $\mathrm{S}$, recursively define ${p}_{t}$ with ${p}_{\mathrm{1}}\mathrm{=}p$ and
${p}_{t+1}(\mathbf{s})$  $={\displaystyle \frac{1}{{Z}_{\alpha}^{t}}}{p}_{t}{(\mathbf{s})}^{\alpha},\forall \mathbf{s}\in \mathcal{S}$ 
where ${Z}_{\alpha}^{t}$ is the normalizing constant and $\alpha \mathrm{\in}\mathrm{[}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{)}$.
The sequence $\mathrm{(}{p}_{\mathrm{1}}\mathrm{,}{p}_{\mathrm{2}}\mathrm{,}\mathrm{\dots}\mathrm{)}$ converges to ${U}_{\mathrm{S}}$, the uniform distribution $\mathrm{S}$.
Proof.
If $\alpha =0$, then ${p}_{2}$ (and all subsequent distributions) will clearly be the uniform distribution. We now study the case where $\alpha \in (0,1)$.
At each iteration $t$, define the onedimensional exponential family $\{{p}_{\theta}^{t}:\theta \in [0,1]\}$ where ${p}_{\theta}^{t}$ is
${p}_{\theta}^{t}(\mathbf{s})={e}^{\theta T(\mathbf{s})A(\theta )+k(\mathbf{s})}$ 
with log carrier density $k(\mathbf{s})=0$, natural parameter $\theta $, sufficient statistic $T(\mathbf{s})=\mathrm{log}{p}_{t}(\mathbf{s})$, and lognormalizer $A(\theta )={\int}_{\mathcal{S}}{e}^{\theta T(\mathbf{s})}\mathit{d}\mathbf{s}$. As shown in Nielsen & Nock (2010), the entropy of a distribution from a onedimensional exponential family with parameter $\theta $ is given by:
${\mathscr{H}}_{\theta}^{t}(\theta )\triangleq \mathscr{H}({p}_{\theta}^{t})=A(\theta )\theta {A}^{\prime}(\theta )$ 
The derivative with respect to $\theta $ is then
$\frac{d}{d\theta}}d{\mathscr{H}}_{\theta}^{t}(\theta )$  $=\theta {A}^{\prime \prime}(\theta )$  
$=\theta {\mathrm{Var}}_{\mathbf{s}\sim {p}_{\theta}^{t}}[T(\mathbf{s})]$  
$=\theta {\mathrm{Var}}_{\mathbf{s}\sim {p}_{\theta}^{t}}[\mathrm{log}{p}_{t}(\mathbf{s})]$  (5)  
$\le 0$ 
where we use the fact that the $n$th derivative of $A(\theta )$ is the $n$ central moment, i.e. ${A}^{\prime \prime}(\theta )={\mathrm{Var}}_{\mathbf{s}\sim {p}_{\theta}^{t}}[T(\mathbf{s})]$. Since variance is always nonnegative, this means the entropy is monotonically decreasing with $\theta $. Note that ${p}_{t+1}$ is a member of this exponential family, with parameter $\theta =\alpha \in (0,1)$. So
$\mathscr{H}({p}_{t+1})={\mathscr{H}}_{\theta}^{t}(\alpha )\ge {\mathscr{H}}_{\theta}^{t}(1)=\mathscr{H}({p}_{t})$ 
which implies
$\mathscr{H}({p}_{1})\le \mathscr{H}({p}_{2})\le \mathrm{\dots}.$ 
This monotonically increasing sequence is upper bounded by the entropy of the uniform distribution, and so this sequence must converge.
The sequence can only converge if $\frac{d}{d\theta}{\mathscr{H}}_{\theta}^{t}(\theta )$ converges to zero. However, because $\alpha $ is bounded away from $0$, Equation 5 states that this can only happen if
${\mathrm{Var}}_{\mathbf{s}\sim {p}_{\theta}^{t}}[\mathrm{log}{p}_{t}(\mathbf{s})]\to 0.$  (6) 
Because ${p}_{t}$ has full support, then so does ${p}_{\theta}^{t}$. Thus, Equation 6 is only true if $\mathrm{log}{p}_{t}(\mathbf{s})$ converges to a constant, i.e. ${p}_{t}$ converges to the uniform distribution. ∎
Lemma B.5.
Given two distribution $p\mathit{}\mathrm{(}x\mathrm{)}$ and $q\mathit{}\mathrm{(}x\mathrm{)}$ where $p\mathrm{\ll}q$ and
$$  (7) 
define the distribution ${p}_{\alpha}$ as
${p}_{\alpha}(x)$  $={\displaystyle \frac{1}{{Z}_{\alpha}}}p(x)q{(x)}^{\alpha}$ 
where $\alpha \mathrm{\in}\mathrm{R}$ and ${Z}_{\alpha}$ is the normalizing factor. Let ${\mathrm{H}}_{\alpha}\mathit{}\mathrm{(}\alpha \mathrm{)}$ be the entropy of ${p}_{\alpha}$. Then there exists a constant $a\mathrm{>}\mathrm{0}$ such that for all $\alpha \mathrm{\in}\mathrm{[}\mathrm{}a\mathrm{,}\mathrm{0}\mathrm{)}$,
${\mathscr{H}}_{\alpha}(\alpha )>{\mathscr{H}}_{\alpha}(0)=\mathscr{H}(p).$  (8) 
Proof.
Observe that $\{{p}_{\alpha}:\alpha \in [1,0]\}$ is a onedimensional exponential family
${p}_{\alpha}(x)={e}^{\alpha T(x)A(\alpha )+k(x)}$ 
with log carrier density $k(x)=\mathrm{log}p(x)$, natural parameter $\alpha $, sufficient statistic $T(x)=\mathrm{log}q(x)$, and lognormalizer $A(\alpha )={\int}_{\mathcal{X}}{e}^{\alpha T(x)+k(x)}\mathit{d}x$. As shown in Nielsen & Nock (2010), the entropy of a distribution from a onedimensional exponential family with parameter $\alpha $ is given by:
${\mathscr{H}}_{\alpha}(\alpha )\triangleq \mathscr{H}({p}_{\alpha})=A(\alpha )\alpha {A}^{\prime}(\alpha ){\mathbb{E}}_{{p}_{\alpha}}[k(X)]$ 
The derivative with respect to $\alpha $ is then
$\frac{d}{d\alpha}}{\mathscr{H}}_{\alpha}(\alpha )$  $=\alpha {A}^{\prime \prime}(\alpha ){\displaystyle \frac{d}{d\alpha}}{\mathbb{E}}_{{p}_{\alpha}}[k(x)]$  
$=\alpha {A}^{\prime \prime}(\alpha ){\mathbb{E}}_{\alpha}[k(x)(T(x){A}^{\prime}(\alpha )]$  
$=\alpha {\mathrm{Var}}_{{p}_{\alpha}}[T(x)]{\mathrm{Cov}}_{{p}_{\alpha}}[k(x),T(x)]$ 
where we use the fact that the $n$th derivative of $A(\alpha )$ give the $n$ central moment, i.e. ${A}^{\prime}(\alpha )={\mathbb{E}}_{{p}_{\alpha}}[T(x)]$ and ${A}^{\prime \prime}(\alpha )={\mathrm{Var}}_{{p}_{\alpha}}[T(x)]$. The derivative of $\alpha =0$ is
$\frac{d}{d\alpha}}{\mathscr{H}}_{\alpha}(0)$  $={\mathrm{Cov}}_{{p}_{0}}[k(x),T(x)]$  
$={\mathrm{Cov}}_{p}[\mathrm{log}p(x),\mathrm{log}q(x)]$ 
which is negative by assumption. Because the derivative at $\alpha =0$ is negative, then there exists a constant $a>0$ such that for all $\alpha \in [a,0]$, ${\mathscr{H}}_{\alpha}(\alpha )>{\mathscr{H}}_{\alpha}(0)=\mathscr{H}(p)$. ∎
Appendix C Environment Details
PointMass: In this environment, an agent must learn to navigate a squareshaped corridor (see Figure 6). The observation is the 2D position, and the agent must specify a velocity as the 2D action. The reward at each time step is the negative distance between the achieved position and desired position.
Visual Pusher: A MuJoCo environment with a 7DoF Sawyer arm and a small puck on a table that the arm must push to a target position. The agent controls the arm by commanding $x,y$ position for the end effector (EE). The underlying state is the EE position, $e$ and puck position $p$. The evaluation metric is the distance between the goal and final puck positions. The hand goal/state space is a 10x10 cm${}^{2}$ box and the puck goal/state space is a 30x20 cm${}^{2}$ box. Both the hand and puck spaces are centered around the origin. The action space ranges in the interval $[1,1]$ in the x and y dimensions.
Visual Door: A MuJoCo environment with a 7DoF Sawyer arm and a door on a table that the arm must pull open to a target angle. Control is the same as in Visual Pusher. The evaluation metric is the distance between the goal and final door angle, measured in radians. In this environment, we do not reset the position of the hand or door at the end of each trajectory. The state/goal space is a 5x20x15 cm${}^{3}$ box in the $x,y,z$ dimension respectively for the arm and an angle between $[0,.83]$ radians. The action space ranges in the interval $[1,1]$ in the x, y and z dimensions.
Visual Pickup: A MuJoCo environment with the same robot as Visual Pusher, but now with a different object. The object is cubeshaped, but a larger intangible sphere is overlaid on top so that it is easier for the agent to see. Moreover, the robot is constrained to move in 2 dimension: it only controls the $y,z$ arm positions. The $x$ position of both the arm and the object is fixed. The evaluation metric is the distance between the goal and final object position. For the purpose of evaluation, $75\%$ of the goals have the object in the air and $25\%$ have the object on the ground. The state/goal space for both the object and the arm is 10cm in the $y$ dimension and 13cm in the $z$ dimension. The action space ranges in the interval $[1,1]$ in the $y$ and $z$ dimensions.
Real World Visual Door: A Rethink Sawyer Robot with a door on a table. The arm must pull the door open to a target angle. The agent controls the arm by commanding the $x,y,z$ velocity of the EE. Our controller commands actions at a rate of up to 10Hz with the scale of actions ranging up to 1cm in magnitude. The underlying state and goal is the same as in Visual Door. Again we do not reset the position of the hand or door at the end of each trajectory. We obtain images using a Kinect Sensor and our robotic control code can be found at https://github.com/mdalal2020/sawyer_control.git The state/goal space for the environment is a 10x10x10 cm${}^{3}$ box. The action space ranges in the interval $[1,1]$ (in cm) in the x, y and z dimensions. The door angle lies in the range $[0,45]$ degrees.
Appendix D Additional Experiments
Section D.1 studies the analyzes how well SkewFit can learn a maximum entropy goal distribution in isolation of a perfect goal reacher. We study this question in the context of a simple 2dimensional navigation environment. Section D.3 analyzes the performance of SkewFit when combined with RIG on a variety of simulated and real visionbased robot manipulation tasks.
D.1 SkewFit on Simplified RL Environment
We analyze the effect of SkewFit for learning a goal distribution in isolation of learning a goal reacher by studying an idealized example where the policy is a nearperfect goalreaching policy.
The MDP is defined on a 2by2 unit squareshaped corridor (see Figure 6). At the beginning of an episode, the agent begins in the bottomleft corner and samples a goal from a goal distribution ${p}_{{\varphi}_{t}}$. The policy reaches the state that is closest to this goal and inside the corridor, giving us a state $\mathbf{S}$ to add to our empirical distribution. We compare SkewFit to a goal sampling distribution that is only trained using maximum likelihood estimation (MLE).
As seen in Figure 6, SkewFit results in learning a high entropy, nearuniform distribution over the state space. In contrast, MLE only models the states that are explored by the initial noise of the policy, resulting in the policy only setting goals in and exploring the bottom left corner. These results empirically validate that naively using previous experience to set goals will not result in diverse exploration and that SkewFit results in a maximumentropy goalsampling distribution.
D.2 Modeling Visual Observations
We would like to use SkewFit to learn maximumentropy distributions over complex, highdimensional state spaces, where we cannot manually design these uniform distributions. The next set of experiments study how SkewFit can be used to train a generative model to sample diverse images when trained on an imbalanced dataset. For these experiments, we use a simulated MuJoCo (Todorov et al., 2012) environment and real environment that each consists of a 7 degree of freedom robot arm in front of a door that it can open. See Figure 3 for a visualization of the simulated and realworld door environment, and the Appendix for more details on the environment.
We generate a dataset of images from the environment by running a policy that samples actions uniformly at random. Such a policy represents a particularly challenging setting for standard VAE training methods: a policy that chooses random actions does not visit uniformly random states, but rather states that are heavily biased toward ones that resemble the initial state. In the door opening environment, this means that many of the samples have the door in the initial closed position, and only the robot’s arm moves. We then train two generative models on these datasets: one using SkewFit and another using MLE. For our generative model, we use the same generative model as the one in RIG, a variational autoencoder Kingma & Welling (2014). To estimate the likelihood of our data, we use Monte Carlo estimation and importance sampling to marginalize the latent variables. See Appendix Section E.2 for experimental details.
In this experiment, we do not train a goalconditioned policy, and instead only study how SkewFit can be used to effectively “balance” this dataset. As can be seen in Figure 7 and Figure 1, the samples produced by the model trained with SkewFit generally have a much wider range of door angles, while the model trained with MLE only captures a single mode of the door, both for simulated and realworld images.
D.3 SkewFit with Learned Policies
2D navigation
We reproduce the 2D navigation environment experiment from Section D.1, and replace the oracle goalreacher with a goalreaching policy that is simultaneously trained with the goal setter. The policy outputs velocities with maximum speed of one. Evaluation goals are chosen uniformly over the valid states. In Figure 7(a), we can see that a policy trained with a goal distribution trained by SkewFit consistently learns to reach all goals, whereas a goal distribution trained with MLE results in a policy that fails to reach states far from the starting position.
Sensitivity Analysis
We study the sensitvity of the $\alpha $ hyperparameter by testing values of $\alpha \in [1,0.75,0.5,0.25,0]$ on the Visual Door and Visual Pusher task. The results are included in Figure 11 and shows that our method is relatively robust to different parameters of $\alpha $, particularly for the more challenging Visual Pusher task.
Prioritization Ablation
We test SkewFit against a concurrent method Zhao & Tresp (2019) that suggests a different strategy in which ${p}_{{\text{skewed}}_{t}}(\mathbf{s})=\frac{1}{{Z}_{\alpha}}{p}_{{\text{emp}}_{t}}(\mathbf{s}){p}_{{\varphi}_{t}}{(\mathbf{s})}^{\alpha},\alpha \in [1,0)$ is replaced with ${p}_{{\text{skewed}}_{t}}(\mathbf{s})=\frac{1}{Z}{p}_{{\text{emp}}_{t}}(\mathbf{s})rank(1{p}_{{\varphi}_{t}}(\mathbf{s}))$. The results are shown in Figure 9. Besides the Visual Door Opening Environment, which is the easiest of three simulated tasks, our method performs significantly better than rankbased weighting, illustrating the effectiveness of SkewFit.
Appendix E Implementation Details
E.1 2D Navigation Experiments
We initialize the VAE to only output points in the bottom left corner of the environment. Both the encoder and decoder have ReLU hidden activations, 2 hidden layers with 32 units, and no output activations. The VAE has a latent dimension of $16$ and a Gaussian decoder trained with meansquared error loss, batch size of $500$, and $100$ epochs per iteration. For SkewFit hyperparameters, $\alpha =0.5$ and $N=10000$.
For the RL version of this task, the VAE was trained in the same way. The RL hyperparameters are listed in Table 1.
E.2 VisionBased Continuous Control Experiments
For our underlying RL algorithm, we use a modified version of soft actor critic (SAC) with automated entropy tuning Haarnoja et al. (2018) and twin Qfunctions Fujimoto et al. (2018). This is in contrast to the original RIG Nair et al. (2018) paper which used TD3 Fujimoto et al. (2018). We found that maximum entropy policies in general improved the performance of RIG, and that we did not need to add noise on top of the stochastic policy’s noise. For our RL network architectures and training scheme, we use fully connected networks for the policy, Qfunction and value networks with two hidden layers of size $400$ and $300$ each. We also delay training any of these networks for $10000$ time steps in order to collect sufficient data for the replay buffer as well as to ensure the latent space of the VAE is relatively stable (since we train the VAE online in this setting). As in RIG, we train a goalconditioned value functions Schaul et al. (2015) using hindsight experience replay Andrychowicz et al. (2017), relabelling $50\%$ of exploration goals as goals sampled from the VAE prior $\mathcal{N}(0,1)$ and $30\%$ from future goals in the trajectory.
In our experiments, we use an image size of 48x48. For our VAE architecture, we use a modified version of the architecture used in the original RIG paper Nair et al. (2018). Our VAE has three convolutional layers with kernel sizes: 5x5, 3x3, and 3x3, number of output filters: 16, 32, and 64 and strides: 3, 2, and 2. We then have a fully connected layer with the latent dimension number of units, and then reverse the architecture with deconvolution layers. We vary the latent dimension of the VAE, the $\beta $ term of the VAE and the $\alpha $ term for SkewFit based on the environment. Additionally, we vary the training schedule of the VAE based on the environment. See the table at the end of the appendix for more details. Our VAE has a Gaussian decoder with identity variance, meaning that we train the decoder with a meansquared error loss.
We estimate the density under the VAE by using a samplewise approximation to the marginal over $x$ estimated using importance sampling:
${p}_{{\varphi}_{t}}(x)$  $={\mathbb{E}}_{z\sim {q}_{{\theta}_{t}}(zx)}\left[{\displaystyle \frac{p(z)}{{q}_{{\theta}_{t}}(zx)}}{p}_{{\psi}_{t}}(x\mid z)\right]$  
$\approx {\displaystyle \frac{1}{N}}{\displaystyle \sum _{i=1}^{N}}\left[{\displaystyle \frac{p(z)}{{q}_{{\theta}_{t}}(zx)}}{p}_{{\psi}_{t}}(x\mid z)\right].$ 
where ${q}_{\theta}$ is the encoder, ${p}_{\psi}$ is the decoder, and $p(z)$ is the prior, which in this case is unit Gaussian. In practice we found that sampling $N=10$ latents for estimating the density to work well in practice.
When training the VAE alongside RL, we found the following two schedules to be effective for different environments:

1.
For first $5K$ steps: Train VAE using standard MLE training every $500$ time steps for $1000$ batches. After that, train VAE using SkewFit every $500$ time steps for $200$ batches.

2.
For first $5K$ steps: Train VAE using standard MLE training every $500$ time steps for $1000$ batches. For the next $45K$ steps, train VAE using SkewFit every $500$ steps for $200$ batches. After that, train VAE using SkewFit every $1000$ time steps for $200$ batches.
We found that initially training the VAE without SkewFit improved the stability of the algorithm. This is due to the fact that density estimates under the VAE are constantly changing and inaccurate during the early phases of training. Therefore, it made little sense to use those estimates to prioritize goals early on in training. Instead, we simply train using MLE training for the first $5K$ timesteps, and after that we perform SkewFit according to the VAE schedules above. Table 2 lists the hyperparameters that were shared across the continuous control experiments. Table 3 lists hyperparameters specific to each environment. Additionally, subsection E.2 shows the combined RIG + SkewFit algorithm.
Hyperparameter  Value 

Algorithm  TD3 Fujimoto et al. (2018)^{6}^{6} 6 We expect similar performance had we used SAC. 
# training batches per time step  $1$ 
Q network hidden sizes  $400,300$ 
Policy network hidden sizes  $400,300$ 
Q network and policy activation  ReLU 
Exploration Noise  None 
RL Batch Size  $1024$ 
Discount Factor  $0.99$ 
Path length  $25$ 
Reward Scaling  $100$ 
Number of steps per epoch  $5000$ 
Hyperparameter  Value  Comments 

# training batches per time step  $2$  Marginal improvements after $2$ 
Exploration Noise  None (SAC policy is stochastic)  Did not tune 
RL Batch Size  $1024$  smaller batch sizes work as well 
VAE Batch Size  $64$  Did not tune 
Discount Factor  $0.99$  Did not tune 
Reward Scaling  $1$  Did not tune 
Path length  $100$  Did not tune 
Number of Latents for Estimating Density ($N$)  $10$  Marginal improvements beyond $10$ 
Hyperparameter  Visual Pusher  Visual Door  Visual Pickup  Real World Visual Door 

Path Length  $50$  $100$  $50$  $100$ 
$\beta $ for $\beta $VAE  $20$  $20$  $30$  $60$ 
Latent Dimension Size  $4$  $16$  $16$  $16$ 
$\alpha $ for SkewFit  $1$  $1/2$  $1$  $1/2$ 
VAE Training Schedule  $2$  $1$  $2$  $1$ 
Sample Goals From  ${p}_{\varphi}$  ${p}_{\text{skewed}}$  ${p}_{\text{skewed}}$  ${p}_{\text{skewed}}$ 
{algorithmic}[1] \REQUIREVAE encoder ${q}_{\varphi}$, VAE decoder ${p}_{\psi}$, policy ${\pi}_{\theta}$, goalconditioned value function ${Q}_{w}$, $\alpha $, VAE Training Schedule. \FOR$n=0,\mathrm{\dots},N1$ episodes \STATESample latent goal ${z}_{g}$ from ${p}_{\varphi}$ which is the prior $p(z)$ or ${p}_{\text{skewed}}$. \STATESample initial state ${s}_{0}\sim E$. \FOR$t=0,\mathrm{\dots},H1$ steps \STATEGet action ${a}_{t}\sim {\pi}_{\theta}(e({s}_{t}),{z}_{g})$. \STATEGet next state ${s}_{t+1}\sim p(\cdot \mid {s}_{t},{a}_{t})$. \STATEStore $({s}_{t},{a}_{t},{s}_{t+1},{z}_{g})$ into replay buffer $\mathcal{R}$. \STATESample transition $(s,a,{s}^{\prime},{z}_{g})\sim \mathcal{R}$. \STATEEncode $z=e(s),{z}^{\prime}=e({s}^{\prime})$. \STATE(Probability $0.5$) replace ${z}_{g}$ with ${z}_{g}^{\prime}\sim p(z)$. \STATECompute new reward $r={z}^{\prime}{z}_{g}$. \STATEMinimize Bellman Error using $(z,a,{z}^{\prime},{z}_{g},r)$. \ENDFOR\FOR$t=0,\mathrm{\dots},H1$ steps \FOR$i=0,\mathrm{\dots},k1$ steps \STATESample future state ${s}_{{h}_{i}}$, $$. \STATEStore $({s}_{t},{a}_{t},{s}_{t+1},e({s}_{{h}_{i}}))$ into $\mathcal{R}$. \ENDFOR\ENDFOR\STATEConstruct skewed distribution ${p}_{{\text{skewed}}_{t}}$ using data from $\mathcal{R}$ using Equation 3 \IF$$ \STATETrain $\beta $VAE on data sampled uniformly from $\mathcal{R}$ according to VAE Training Schedule \ELSE\STATETrain $\beta $VAE on data sampled from $\mathcal{R}$ using ${p}_{\text{skewed}}$ according to VAE Training Schedule \ENDIF\ENDFOR