Abstract
Deep reinforcement learning encompasses many versatile tools for designinglearning agents that can perform well on a variety of highdimensional visualtasks, ranging from video games to robotic manipulation. However, these methodstypically suffer from poor sample efficiency, partially because they strive tobe largely problemagnostic. In this work, we demonstrate the utility of adifferent approach that is extremely sample efficient. Specifically, we proposethe Hypothesis Proposal and Evaluation (HyPE) algorithm, which utilizes a smallset of intuitive assumptions about the behavior of objects in the physicalworld (or in games that mimic physics) to automatically define and learnhierarchical skills in a highly efficient manner. HyPE does this by discoveringobjects from raw pixel data, generating hypotheses about the controllability ofobserved changes in object state, and learning a hierarchy of skills that cantest these hypotheses and control increasingly complex interactions withobjects. We demonstrate that HyPE can dramatically improve sample efficiencywhen learning a highquality pixelstoactions policy; in the popular benchmarktask, Breakout, HyPE learns an order of magnitude faster than common baselinereinforcement learning and evolutionary strategies for policy learning.
Quick Read (beta)
HypothesisDriven Skill Discovery for
Hierarchical Deep Reinforcement Learning
Abstract
Deep reinforcement learning encompasses many versatile tools for designing learning agents that can perform well on a variety of highdimensional visual tasks, ranging from video games to robotic manipulation. However, these methods typically suffer from poor sample efficiency, partially because they strive to be largely problemagnostic. In this work, we demonstrate the utility of a different approach that is extremely sample efficient. Specifically, we propose the Hypothesis Proposal and Evaluation (HyPE) algorithm, which utilizes a small set of intuitive assumptions about the behavior of objects in the physical world (or in games that mimic physics) to automatically define and learn hierarchical skills in a highly efficient manner. HyPE does this by discovering objects from raw pixel data, generating hypotheses about the controllability of observed changes in object state, and learning a hierarchy of skills that can test these hypotheses and control increasingly complex interactions with objects. We demonstrate that HyPE can dramatically improve sample efficiency when learning a highquality pixelstoactions policy; in the popular benchmark task, Breakout, HyPE learns an order of magnitude faster than common baseline reinforcement learning and evolutionary strategies for policy learning.
capbtabboxtable[][\FBwidth] \captionsetupbelowskip=0pt
HypothesisDriven Skill Discovery for
Hierarchical Deep Reinforcement Learning
Caleb Chuck Department of Computer Science The University of Texas at Austin Austin, TX 78705 [email protected] Supawit Chockchowwat Department of Computer Science The University of Texas at Austin Austin, TX 78705 [email protected] Scott Niekum Department of Computer Science The University of Texas at Austin Austin, TX 78705 [email protected]
noticebox[b]Preprint. Under review.\[email protected]
1 Introduction
While recent advancements in deep reinforcement learning (RL) have been used to obtain exciting results on a variety of highdimensional visual tasks, ranging from video games to robotic manipulation, these algorithms often require large amounts of data in order to achieve good performance. In many realworld tasks, such as robotics, this data is difficult and expensive to collect in large quantities. Though one of many factors, one cause of this high sample complexity is that deep RL algorithms typically strive to be as problemagnostic as possible. Generality, while a laudable goal, often ignores powerful inductive biases—some of which may work well across large collections of problems of interest. In this work, we focus on objectcentric tasks that (approximately) obey a set of commonsense physical laws; for example, we assume that objects have a consistent visual appearance and have properties (for instance, position or velocity) that do not change unless acted upon by another object. Many physical tasks admit an object based latent space with some consistent visual appearances, in which case factorizing the state using these assumptions can greatly reduce state space complexity. Furthermore, though there is a combinatorially large number of possible arbitrary hypotheses, many physical systems only have a small number of interesting circumstances under which object changes will occur (i.e., contant when push an object, or using a key to open a door). Thus, our assumptions direct exploration to target these states over any arbitrary unseen state. In exchange for this limiting set of assumptions, we will demonstrate that it is possible to realize dramatic gains in sample efficiency for problems in which these assumptions hold. Furthermore, we argue that this set of assumptions is quite reasonable for a wide range of tasks, ranging from pseudophysical video games to robotic manipulation.
To leverage the problem structure provided by objects in physical reinforcement learning tasks, we introduce the Hypothesis Proposal and Evaluation (HyPE) algorithm. As an example of the type of learning process we wish to capture, consider a human learning to play the Atari game Breakout, in which the player must control a paddle to bounce a ball, which in turn, can destroy bricks at the top of the screen. By experimenting with random inputs, the player observes the ball and paddle moving, quickly identifying them as objects that obey pseudophysical laws. However, only the paddle motion appears to be directly correlated with the controller inputs, so the player attempts to learn to control it first. As they do, they observe that the ball bounces off the paddle and destroys the bricks, which provides reward. Recognizing that they cannot affect the bricks through the paddle directly, the player learns to control the ball via bouncing with the paddle. Finally, once the player has learned to control the ball, they can learn strategies to aim and destroy bricks quickly, completing the game.
HyPE formalizes the intuitions behind such a learning process in a 3step learning loop:

1.
State abstraction via object discovery: The first step of the HyPE loop discovers objects from raw pixel inputs by learning convolutional filters that meet certain physicsguided criteria. Reasoning about objects provides a factorization of state that can circumvent the need to experience dense samples from the full distribution of possible states. Instead, HyPE learns to control simpler objectobject interactions that each only rely on a small subset of state information. For example, changing the directional velocity of the ball with the paddle is only dependent upon the paddle and ball positions, and not the wall or bricks.

2.
Skill proposal via hypothesis generation: The second step of the HyPE loop proposes hypotheses about what caused the changes it observed in object properties. Namely, it hypothesizes about whether the change is controllable and, if so, which objectobject interactions can control it. HyPE then converts each hypothesis into a goal for a corresponding skill/option sutton , such that the control hypothesis becomes testable. For example, hypothesizing that the paddle can be used to change the directional velocity of the ball produces a goal of bouncing the ball.

3.
Hierarchical skill learning via hypothesis evaluation: The third step of the HyPE loop uses RL to try to learn the options proposed in the previous step, thus testing the generated hypotheses. Successful learning of this option confirms the interaction hypothesis. However, for learning to be tractable, a given option often uses other previously learned options as primitive actions, leading to a hierarchy of skills. For example, an option for destroying a brick can use the ballcontrol options as primitive actions, which in turn use paddlecontrol options as primitive actions, which in turn use raw controller inputs.
We demonstrate that HyPE improves the sample efficiency of policy learning on the classic arcade game of Breakout from raw pixels by an order of magnitude, as compared to baseline deep RL algorithms. In addition, we show that HyPE automatically constructs a control structure which describes and characterizes several intuitive components of the game, providing evidence that HyPE contains the right set of inductive biases to serve as a foundation for scaling RL to realworld manipulation tasks.
2 Assumptions
We frame our problem as a Markov Decision Process (MDP). At each time step $t$, the agent observes state $x\in X$, with starting state distribution ${X}_{0}$, and takes action $a\in {A}_{\text{raw}}$. The next state is determined by $T(x,a,{x}^{\prime})$, which is the probability of experiencing subsequent state ${x}^{\prime}$ given current state $x$ and current action $a$. A policy $\pi (x,a)$ is defined as $P(ax)$, or the probability of an action $a$ given state $x$. The agent receives external reward as a function of the current state and action $R(x,a)\to \mathbb{R}$. From this, we can define the return as the discounted future reward: ${r}_{t}={\sum}_{i=t}^{T}{\gamma}^{it}R({x}_{i},{a}_{i})$, where $\gamma $ is a given discount factor. Reinforcement learning problem tries to find the policy that maximizes this total expected return.
To solve this MDP efficiently, our proposed algorithm generates and tests hypotheses about relationships between actions and objects in the world based on several assumptions. These assumptions guide where interesting events will occur (locations), when to look for such events (salient times), and how to check for controllability (option policy learning).
In our instantiation of HyPE, we define ${f}^{({o}_{i})}$ to be a function that finds the location of object ${o}_{i}$ in an input image, represented as ${x}_{{o}_{i}}^{(t)}$—the x,y coordinates of that object (time indexed by $t$). $\mathrm{\Delta}{x}_{{o}_{i}}$ represents a change in position. In general, ${f}^{({o}_{i})}$ and ${x}_{{o}_{i}}^{(t)}$ can encompass a broader set of object properties, rather than only position. The following subsections outline additional assumptions made by our instantiation of HyPE, but note that many of these assumptions can similarly be generalized beyond properties such as position.
2.1 Object Structure
Our first assumption is that the world is comprised of objects with consistent properties and relationships. We make this assumption not only to provide a basis for factorizing the state space, but also to factorize the action space. We use ${x}_{\text{raw}}$ to denote the input state, an image.
Consistent Visual Properties: We assume there exists a function ${f}^{{o}_{i}}({x}_{\text{raw}})A\to {x}_{{o}_{i}}$, which maps from the raw state ${x}_{\text{raw}}$, to the location ${x}_{{o}_{i}}$ of object ${o}_{i}$. This implies that visual cues can be used to determine the location of an object in every state. A
Object Control Relationships: We say that there is a relationship between two objects ${o}_{i},{o}_{j}$ If one object ${o}_{i}$ can be used to control another object ${o}_{j}$. This relationship is characterized by at least one control $\mathrm{\Delta}{x}_{{o}_{j}}$, and corresponding policy with an action space ${A}_{{o}_{i}}$ defined over ${o}_{i}$ (manipulation of the first object), which produces $\mathrm{\Delta}{x}_{{o}_{j}}$. This should define at least two behaviors: the natural behavior in ${o}_{j}$ when not controlled by ${o}_{i}$, and the control policy behavior caused by ${o}_{i}$. This is formalized by contrasting the control policy ${\pi}_{{A}_{{o}_{i}}}^{(\mathrm{\Delta}{x}_{{o}_{j}})}({x}_{\text{raw}},a)$ with a random policy ${\pi}_{\text{random}}({x}_{\text{raw}},a)$, which has an ${\u03f5}_{\pi}$ lower probability of producing the control behavior. In other words, the control policy has the property (with time horizon $T$):
$$P({x}_{{o}_{j}}^{(t+T)}{x}_{{o}_{j}}^{(t)}=\mathrm{\Delta}{x}_{{o}_{j}}{\pi}_{{A}_{{o}_{i}}}^{(\mathrm{\Delta}{x}_{{o}_{j}})}({x}_{\text{raw}},a))P({x}_{{o}_{j}}^{(t+T)}{x}_{{o}_{j}}^{(t)}=\mathrm{\Delta}{x}_{{o}_{j}}{\pi}_{\text{random}}({x}_{\text{raw}},a))\ge {\u03f5}_{\pi}$$  (1) 
The set of possible $\mathrm{\Delta}{x}_{{o}_{j}}$ define ways to manipulate ${x}_{{o}_{j}}$ within a set time horizon. If ${\pi}_{{A}_{{o}_{i}}}^{(\mathrm{\Delta}{x}_{{o}_{j}})}$ is learned for at least one $\mathrm{\Delta}{x}_{{o}_{j}}$, we can use these $\mathrm{\Delta}{x}_{{o}_{j}}$ an action space ${A}_{{x}_{{o}_{j}}}$ on object ${o}_{j}$. In HyPE, ${o}_{i}$ and ${o}_{j}$ are only related if changes in ${o}_{j}$ can be induced by ${A}_{{x}_{{o}_{i}}}$. Since ${x}_{{o}_{j}}$ is a position, $\mathrm{\Delta}{x}_{{o}_{j}}$ is a displacement.
We can also treat the actions taken on the base MDP ${A}_{\text{raw}}$ and the extrinsically defined environment reward $R(x,a)$ (the score in video games) as special types of objects. The state, ${x}_{{o}_{i}}$ of these objects is the instantiation taken at time $t$ (action taken and reward received respectively). These are “abstract objects”, since they are not observable in ${x}_{\text{raw}}$.
2.2 Proximity
We will make use of a saliency $S:{\{{x}_{{o}_{i}}^{(t)}\times {x}_{{o}_{j}}^{(t)}\}}_{t\in 0,\mathrm{\dots}T}\to {\{0,1\}}^{T}$, which determines timesteps where one object is likely to effect change in another object. While in general, any two objects might interact at any time, we use spatial proximity and the quasistatic property (defined in the next section) to limit the search for object interactions. This narrowing of the search space of object interactions makes search more tractable, and limits the number of spurious relationships that could confuse a learner.
Spatial proximity uses ${x}_{{o}_{i}}$ (the location of object ${o}_{i}$) from Section 2.1. For objects ${o}_{i}$ and ${o}_{j}$, we can define the proximity of these objects to be if ${\parallel {x}_{{o}_{i}}{x}_{{o}_{j}}\parallel}_{2}\le \u03f5$, that is, the l2norm is less than some epsilon.
Note that we implicitly assume temporal proximity in the defining saliency, since we say that certain timesteps are salient. This implies that object effects (changes) and salient events cooccur within a short time window.
2.3 Quasistatic Property
We additionally assume that some properties of objects are quasistatic—they do not change unless acted upon by another object. By making this assumption, we gain a clear indicator of when an interaction has occurred, and use that information to find an explanation for this change. This property is also physically supported: most objects do not spontaneously exhibit changes, especially in physical scenarios. Furthermore, even if such spontaneous changes occur (i.e. a chemical reaction), this property can be used to reason about the untracked variables that induced that change.
To define changepoints, we use the formulation from Changepoint Detection using Approximate Model Parameters (CHAMP) niekum . This formulation finds $m$ changepoints in a trajectory, $\{{c}_{1},\mathrm{\dots},{c}_{m}\}$, where changepoints are determined such that within each segment ${\{{x}_{{o}_{i}}^{(t)}\}}_{t\in \{{c}_{k}\mathrm{\dots}{c}_{k+1}\}}$, a model $M$ has the property $M({x}_{{o}_{i}}^{(t)})\approx {x}_{{o}_{i}}^{(t+1)}$, along with regularization of and constraints on the number of models. In other words, in segments between changepoints, a simple model can roughly predict the next state. Our model choice is: ${M}_{d}({x}_{{o}_{i}})={x}_{{o}_{i}}+d$, a fixed vector $d$displacement model.
In this work, we use the quasistatic property in two ways. First, the quasistatic property assumes that changepoints are caused by an objectobject interactions. Second, we use the quasistatic property to assert that object changepoints are salient times to search for changes in other objects (for example, flipping a switch might be a good time to check if anything in the environment changes).
2.4 Contingency
While there may be many objects in the world, we can limit the ones we are interested in by searching only for objects that are contingent, that is, objects we can control. By doing this, we avoid the problem of learning the full forward dynamics of a system, which might be very difficult, especially in a model free setting. At the same time, we still learn about all the objects the agent is able to effect in the environment.
We define contingent objects recursively: a contingent object can either be controlled directly by the actions available to the agent, ${A}_{\text{raw}}$, or through control of a different contingent object. For example, the ball in Breakout is contingent because it can be controlled through the paddle (which is controlled directly by ${A}_{\text{raw}}$), but the walls are not because they cannot be controlled by any contingent object.
To codify these object interactions, we use a graph $\mathcal{G}$. A node ${N}_{i}$ of $\mathcal{G}$ corresponds to an object ${o}_{i}$. A directed edge between a node only exists if a control policy as defined in Equation 1 has been learned. An object ${o}_{j}$ is contingent if there exists a path from the abstract node corresponding to the raw actions (call this ${\mathcal{A}}_{\text{raw}}$), and ${N}_{j}$. HyPE only attempts to learn edges from contingent nodes. The graph structure for Breakout is shown in Figure 2.
2.5 Optionbased Causal Discovery
Finally, in HyPE we assume that if we can learn at least one ${\pi}_{{A}_{{o}_{i}}}^{(\mathrm{\Delta}{x}_{{o}_{j}})}({x}_{\text{raw}},a)$ from a contingent node ${o}_{i}$, then object ${o}_{i}$ and object ${o}_{j}$ has a causal relationship where ${o}_{i}$ causes $\mathrm{\Delta}{x}_{{o}_{j}}$ (adding an edge between ${N}_{i}$ and ${N}_{j}$ in the graph defined in Section 2.4). This is because at least one ${\pi}_{{A}_{{o}_{i}}}^{(\mathrm{\Delta}{x}_{{o}_{j}})}({x}_{\text{raw}},a)$ implies there are at least two ways to control ${o}_{j}$: $\mathrm{\Delta}{x}_{{o}_{j}}$ with a probability higher than ${\u03f5}_{\pi}$, using $p{i}_{{A}_{{o}_{i}}}^{(\mathrm{\Delta}{x}_{{o}_{j}})}$, and causing $\mathrm{\Delta}{x}_{{o}_{j}}$ with a probability higher than $\u03f5$ with a probability lower than ${\u03f5}_{\pi}$ using ${\pi}_{random}$. Recall that ${\pi}_{{A}_{{o}_{i}}}^{(\mathrm{\Delta}{x}_{{o}_{j}})}({x}_{\text{raw}},a)$ produces $\mathrm{\Delta}{x}_{{o}_{j}}$ with ${\u03f5}_{p}i$ higher probability than a baseline policy (typically the random policy). This means that through displacements of ${o}_{i}$, this learned policy can do nontrivial control of ${o}_{j}$, which define an action space over the target object ${A}_{{o}_{j}}$.
3 Proposed Algorithm
Our system involves a threestep hypothesis proposal and evaluation loop (HyPE loop). We define a hypothesis as a Boolean statement about a relationship between two objects ${o}_{i},{o}_{j}$. At each full iteration of the loop, the system adds edges to the graph $\mathcal{G}$ as described in Section 4. The HyPE loop’s three steps are:

1.
Object discovery, which locates a new object ${o}_{j}$ by learning a new ${f}^{{o}_{j}}({x}_{\text{raw}})$.

2.
Hypothesis proposal, which uses past data (possibly collected using prior iterations of HyPE) to define a hypothesis $H$ about where and how two objects interact.

3.
Hypothesis verification, where the agent learns to reliably reproduce the interaction between two objects.
Over multiple iterations of the HyPE loop, an object interaction graph $\mathcal{G}$ as defined in section 2.4 is generated, one edge at a time. We start with a single node, ${\mathcal{A}}_{\text{raw}}$, and use it to discover filters to locate new objects in the frame, and learn how those objects can be controlled.
3.1 Object Discovery
This step of the algorithm seeks to visually identify objects that are controllable, either directly or via another object. This corresponds to searching for visual features whose movements can be explained by either direct action inputs or interactions with contingent objects. We make this into an optimization problem, in which we try to learn ${f}^{{o}_{j}}({x}_{\text{raw}})$, a filter on the input image that identifies the location of an object that meets the aforementioned controllability criteria. We structure this section by describing the following:

1.
Using the quasistatic property to find changes in a new object

2.
Incorporating saliency with another object can be used to explain these changes.

3.
Combine these into a loss term

4.
Add smoothness to define the full loss function

5.
Optimize the defined loss term
Visual Changepoints: The function ${f}^{{o}_{j}}:{x}_{\text{raw}}\to {\mathbb{R}}^{2}$, represented by a convolutional neural network (CNN), outputs locations ${x}_{{o}_{j}}^{(t)}$ at every timestep (initially, these locations will probably be random, since the initial network is random, but given successful training, will track one of the objects). Taking this time series sequence of locations, we apply CHAMP to get a sequence of changepoints ${\tau}_{1},\mathrm{\dots}{\tau}_{m}$. We define $C({x}_{{o}_{j}}^{(t)})):\mathbb{R}{}^{2}\to \{0,1\}$ to be an indicator of whether there is a changepoint at time $t$ in the sequence ${\{{x}_{{o}_{j}}\}}_{t\in [1,\mathrm{\dots},T]}$. From the quasistatic property, we expect that some kind of object interactions produced these changepoints.
Explaning Changepoints: Given the changepoints of ${o}_{j}$, we search for other objects which might have caused some of these changepoints. We narrow our search by restricting possible object interactions to salient regions. If ${o}_{i}$ to interacts with ${o}_{j}$ at time $t$ we expect to see changepoints $C({x}_{{o}_{j}}^{(t)}))=1$ at a salient point. Define $S({x}_{{o}_{i}}^{(t)},{x}_{{o}_{j}}^{(t)}):{\mathbb{R}}^{2}\times {\mathbb{R}}^{2}\to \{0,1\}$ to be an indicator if a point is salient. If we want all changepoints to be explained by interaction with object ${o}_{j}$, this entails $C({x}_{{o}_{j}}^{(t)}))\iff S({x}_{{o}_{i}}^{(t)},{x}_{{o}_{j}}^{(t)})$ for all $t$.
Changepoint Loss: However, the above comparison assumes that all changepoints are explained by salient points, and all salient points are accompanied by changepoints. This is not always (or often) the case. For example, the ball might bounce off of objects other than the paddle (a changepoint without being salient), and the ball might come close to the paddle without interacting with it (a salient point that is not a changepoint). To account for this, we search for statistically significant amounts of interaction from our historical data (data collected by prior iterations of the HyPE loop). We take the number of changepoints that occur when salient $N(S({x}_{{o}_{i}}^{(t)},{x}_{{o}_{j}}^{(t)})\wedge C({x}_{{o}_{j}}^{(t)}))$, and compare with the number of changepoints $N(C({x}_{{o}_{j}}^{(t)}))$ and the number of salient points $N(S({x}_{{o}_{i}}^{(t)},{x}_{{o}_{j}}^{(t)}))$.
To search for statistical significance, we use the F1 score $F1({\text{\mathbf{x}}}_{{o}_{i}},{\text{\mathbf{x}}}_{{o}_{j}})$ (defined in the appendix) between these three terms. The F1 score measures the statistical significance of events being correlated. Since the F1 score is computed over counts on the whole dataset (length $n$), we use ${\text{\mathbf{x}}}_{o}$ to denote the full dataset of object positions. Notice that we want the number of explained changepoints ($N(S({x}_{{o}_{i}}^{(t)},{x}_{{o}_{j}}^{(t)})\wedge C({x}_{{o}_{j}}^{(t)}))$) to be significant relative to the total number of changepiont $N(C({x}_{{o}_{j}}^{(t)}))$, so that the learned filter does not simply make every frame a changepoint to optimize the explained changepoints. Similarly, we do not want to learn another feature which tracks ${o}_{j}$, so we also want the number of explained changepoints to be significant relative to the number of salient points $N(S({x}_{{o}_{i}}^{(t)},{x}_{{o}_{j}}^{(t)}))$. The F1 score gives an an optimizable value which balances these objectives.
Loss function: We add an additional regularization to prevent long jumps, which are not physical. $\parallel {f}^{{o}_{j}}({x}_{\text{raw}}^{(t+1)}){f}^{{o}_{j}}({x}_{\text{raw}}^{(t)})\parallel $. This regularizes displacement of ${x}_{{o}_{j}}$ (controlled by ${\lambda}_{1}$). This gives our full objective for object discovery:
$$\begin{array}{cc}\hfill L({\text{\mathbf{x}}}_{{o}_{i}};{\text{\mathbf{x}}}_{{o}_{j}})=& {F}_{1}({\text{\mathbf{x}}}_{{o}_{i}};{f}^{{o}_{j}}({\text{\mathbf{x}}}_{\text{raw}})){\lambda}_{1}\sum _{t=0}^{n1}\parallel {f}^{{o}_{j}}({x}_{\text{raw}}^{(t+1)}){f}^{{o}_{j}}({x}_{\text{raw}}^{(t)})\parallel \hfill \end{array}$$  (2) 
Optimization: We optimize this objective with covariance matrix adaptation evolution strategy (CMAES, hansen ). In addition, we smooth the outputs to clean up relatively rare cases where the filter fails, an operation has been shown to improve performance on downstream deep learning policy tasks chuck . We represent ${f}_{\theta}^{{o}_{j}}$ with a two layer convolutional network with $3\times 3$ and $5\times 5$ filters.
3.2 Hypothesis Proposal
We generate hypotheses about whether a causal relationship exists between two objects. In section 2.1, we describe object relations as being control of one object through actions defined over another. Thus, to propose a hypothesis, we need to first check if the proposed hypothesis has reasonable evidence from historical data, and then generate a function which indicates when a control produced by another object occurs. In this section, we will describe a means of generating two different kinds of control indicators: a changepoint control, and a displacement control.
Changepoint Control: Before generating a changepoint indicator, we need to determine whether a (controlled) changppoint occurs with statistical significance. This mirrors the F1 criteria defined in the last section, since the objective is closely related: we want to see if our premise affects our target often in data. However, since we are not trying to optimize the inputs, the HyPE hypothesis proposal step first tests:
$$\chi (C({\text{\mathbf{x}}}_{{o}_{j}}),S({\text{\mathbf{x}}}_{{o}_{i}},{\text{\mathbf{x}}}_{{o}_{j}}))=N(S({x}_{{o}_{i}}^{(t)},{x}_{{o}_{j}}^{(t)})\wedge C({x}_{{o}_{j}}^{(t)}))\ge \eta N(S({x}_{{o}_{i}}^{(t)},{x}_{{o}_{j}}^{(t)})\wedge \mathrm{\neg}C({x}_{{o}_{j}}^{(t)}))$$  (3) 
This indicates if a changepoint occurs at a salient time ($S({x}_{{o}_{i}}^{(t)},{x}_{{o}_{j}}^{(t)})\wedge C({x}_{{o}_{j}}^{(t)})$) $\eta $ times more often than not ($S({x}_{{o}_{i}}^{(t)},{x}_{{o}_{j}}^{(t)})\wedge \mathrm{\neg}C({x}_{{o}_{j}}^{(t)})$). If this test fails, i.e., in Breakout when checking whether ball changepoints cooccur with actions, then it does not make sense to propose a hypothesis. That does not mean that a hypothesis does not exist—there might be insufficient data, but it does suggest searching for other, more likely object interactions. On the other hand, if the test passes this suggests that one object can be controlled by interaction with another, i.e., the ball can be controlled by bouncing with the paddle. Changepoint indicators take the form:
$$H({x}_{{o}_{i}}^{(t)},{x}_{{o}_{j}}^{(t)})=S({x}_{{o}_{i}}^{(t)},{x}_{{o}_{j}}^{(t)})\wedge C({x}_{{o}_{j}}^{(t)})$$  (4) 
Which is a single state version of the Equation 3: observing if a changepoint $C({x}_{{o}_{j}}^{(t)})$ occurs at a salient time $S({x}_{{o}_{i}}^{(t)},{x}_{{o}_{j}}^{(t)})$. In the case of the ball and paddle, it would return 1 (true) when the ball bounces off the paddle, and 0 (false) in other circumstances.
Displacement Control: However, we might want to do more than just induce an arbitrary changepoint. If supported by data, learning options which induce a particular control, in the form of a displacement, will give more options and varied control. For example, rather than simply training the paddle to change any direction, training control would give options that correspond to right, left or stationary movements.
Recall $\mathrm{\Delta}{x}_{{o}_{j}}=d$ is a displacement of the target object, and ${M}_{d}(\cdot )$ is a fixed $d$displacement model (see Section 2.3) for a segment which follows some changepoint ${c}_{k}$ (a timestep). Then, the control hypothesis checks if the segment after a salient changepoint $S({x}_{{o}_{i}}^{(t)},{x}_{{o}_{j}}^{(t)})\wedge C({x}_{{o}_{j}}^{(t)})$ (as defined in the last section), is fit to a particular, desired $d$displacement model:
$${H}_{d}({x}_{{o}_{i}}^{(t,t+1)},{x}_{{o}_{j}}^{(t,t+1)})=\parallel {x}_{{o}_{j}}^{(t+1)}{x}_{{o}_{j}}^{(t)}d\parallel \le \u03f5\wedge S({x}_{{o}_{i}}^{(t)},{x}_{{o}_{j}}^{(t)})$$  (5) 
In the ball example, this would mean that after the ball bounces off the paddle, a control indicator will indicate true only if the angle of the ball bounce is equal to the desired ball bounce.
In order to decide which displacements to propose hypotheses for (given the check defined in Equation 3 has already passed), the HyPE algorithm takes the segment models ${M}_{d}$ following salient changepoints, and clusters them according to the $d$ parameter using Dirichlet Process Gaussian Mixture Models (clustering is used to denoise). If the cluster corresponding to ${d}_{k}$ has a minimum number of assignments, that control hypothesis is proposed.
3.3 Hypothesis Evaluation
Now, we want to test whether the hypothesis about causal control between two objects is true. This is done by simply incorporating the boolean control indicators as reward functions for different options, output 1 if the hypothesis is true and 0 if false:
$$R({s}^{(t)})=\{\begin{array}{cc}1\hfill & H({x}_{{o}_{i}}^{(t,t+1)},{x}_{{o}_{j}}^{(t,t+1)})\hfill \\ 0\hfill & \text{otherwise}\hfill \end{array}$$  (6) 
We define our actions as the learned options, ${A}_{{o}_{i}}$, over the premise object ${o}_{i}$ (not the raw actions ${A}_{\text{raw}}$). This allows us to gain sample efficiency through action abstraction and hierarchy.
To gain sample efficiency through state abstraction, we use as input the locations of the premise and target objects. Though this assumption can be limiting (if there is some other, unknown object that affects the target, the policy does not account for this object), we use this abstraction because the performance benefits appear to outweigh this cost. Even with other objects that are unaccounted for, we still expect to induce a change more often than random, allowing future iterations of HyPE to learn about the unknown object(s).
In order to maximize the advantage of our state abstraction, we add several simple transformations of the inputs to our state space, namely velocity and relative position of the objects: ${x}_{{o}_{j}}^{(t+1)}{x}_{{o}_{j}}^{(t)}$ and ${x}_{{o}_{i}}^{(t)}{x}_{{o}_{j}}^{(t)}$. Our neural net architecture computes the following:
$$\text{softmax}\left({W}_{\text{feedforward}}\frac{1}{k}\sum _{i=0}^{k}\sigma ({W}_{\text{vector}}{\varphi}_{k}({x}_{{o}_{i}},{x}_{{o}_{j}}))\right)$$  (7) 
Where $\varphi (\cdot ,\cdot )$ are our input operations. This architecture converts each input state into a length 128 vector, where ${W}_{vector}$ is a $128\times 2$ matrix of weights (all input properties have dimension 2), and $\sigma $ is some activation. It then takes the mean of all feature vectors, and feeds these forward to the outputs.
Algorithm  Base  HyPE  Rainbow  A2C & PPO 

Timesteps  52,000  55,500  $\sim 1,000,000$  $>1,500,000$ 
4 Results
We demonstrate that HyPE can learn to achieve high performance on the classic game Breakout after two iterations of the algorithm loop. Both the visual and behavior components have intuitive interpretations, so we can observe the performance of the HyPE system as it progresses. As a step by step progression: since the HyPE loop has no initial information, it learns from 1k frames of data with a random policy, and then uses that information with the loss defined in Equation 2, and a prior object of ${o}_{i}={\mathcal{A}}_{\text{raw}}$, which is the only node in the graph so far. Since the vision loss is above the threshold F1 score, $\u03f5$, it checks and proposes a hypothesis, which results in three characteristic behaviors as defined in Equation 5 after clustering^{1}^{1} 1 The HyPE loop automatically chooses a control hypothesis when it has control clusters with more than 10 assignments. Having finished step 2, the HyPE loop performs the learning of three hypothesisgenerated options—moving the paddle $\pm 2$ or $0$ pixels, respectively. These options use paddle position and velocity as input, and learning converges in 2.5k timesteps. Though this first step of the HyPE loop learns an intuitive first object (the paddle), this is not encoded explicitly anywhere in the algorithm, but emerges from physical priors and controllability. Because learning succeeds, the HyPE loop adds a new node, connected to ${\mathcal{A}}_{\text{raw}}$, which we call ${N}_{\text{paddle}}$.
Using the cumulative data, the HyPE loop then applies the vision loss to all nodes. ${\mathcal{A}}_{\text{raw}}$, with the paddle mean removed from the image, and does not meet the threshold to propose any new objects. However, ${N}_{\text{paddle}}$ discovers a new object—the ball (the results of the vision step are shown in Figure 3). In this case, the ball discovery is a consequence of the changepoints it exhibits near the paddle. The subsequent hypothesis check for changepoints near the paddle passes and is used to generate a reward function for learning an option to bounce the ball off of the paddle. Using this learned option continuously is sufficient to achieve high extrinsic reward in a small number of frames.
In Figure 4, We show that HyPE has roughly an order of magnitude improvement in sample efficiency compared to baseline RL methods. HyPE, at train time, achieves average reward per episode of $17.5$ in 55k frames, while Rainbow hessel takes 400k timesteps, and Proximal Policy Optimization schulman and A2C mnih take roughly 1.4M timesteps to achieve the same performance. However, CMAES, used to learn the HyPE bouncing policy, typically has higher test performance than train. Thus, we also show that the evaluation policy learned by HyPE after 55000 frames achieves $244$ average reward per episode performance, which is more than an order of magnitude better than Rainbow, the best performing baseline.
Finally, in order to better understand the performance gains of HyPE, we demonstrate that a learner using the actual positions of the ball and the paddle (and relative positions) achieves performance similar to that learned by HyPE, as shown by the “Base” in Figure 4 and Table 1. This implies that the majority of the performance improvement comes from using the object relative input states. However, HyPE provides a principled method for learning and basis for using those objectrelative input states, as well as the ability to perform targeted, hierarchical exploration.
5 Related Work
This work amalgamates ideas from several different subdirections of reinforcement learning. In particular, it incorporates concepts from causal graphs, relational and objectoriented reinforcement learning, objectscene modeling literature, model based reinforcement learning, exploration methods in reinforcement learning and optionbased or hierarchical reinforcement learning.
Causal graphs: The control graph structure and hypothesis verification components of HyPE draw upon ideas related to causality pearl . These components of HyPE relate to where graphs Shanmugam and graphdependent policies Buesing ; wangnerve are learned from interactions with the environment. Graph network architectures provide a differentiable structure by which to learn causal object relationships battaglia ; garnelo , and have been used to pick up object relationships toyer for forward modeling through object oriented and relational reinforcement learning diuk ; zambaldi . Objectoriented and relational reinforcement learning often use preconditions and postconditions to determine object relationships, and have been used for forward modeling watters ,curriculum generation silva and transfer marom . Alternatively, HyPE uses a similar objectoriented network structure for highly efficient hierarchical reinforcement learning and visual factorization.
Schema networks Kansky , combine intuitive physics with a process for learning causal networks for gaining sample efficiency on model transfer. These networks achieve results related to object interactions and useful control over them which parallels the learned objectinteraction graph from the HyPE loop. However, Schema Networks do not learn objects from raw inputs or provide a curriculum of learning over different objects, preventing them from having the same sampleefficiency benefits on nontransfer problems as HyPE. To our knowledge, HyPE represents a unique method for sequentially generating a causal graph structure from the ground up.
Scene Modeling: Decomposing a scene into elements, and modeling the forward dynamics of those elements is a classic problem. HyPE approaches this problem in an unsupervised manner. Several works have shown promising results in using neural networks in the form of variational auto encoders higgins , or expectation maximization van2018relational to decompose scenes to latent spaces, sometimes factorized burgess . In addition to simply decoming the scene, some methods strive to learn a forward model, in the raw pixel space chiappa2017recurrent , the latent space hafner2018learning , using pairwise interactions chang2016compositional , graph networks battaglia2016interaction or Bayesian inference eslami2016attend ; kosiorek2018sequential . Some methods also incorporate physical properties, such as energy conservation sienko2002hamiltonian , force mottaghi2016happens or intuitive physics agrawal2016learning . Broad physical assumptions have been incorporated into work in probing intuitive dynamics Piloto , learning physical dynamics or relations Chang ; zambaldi , object representations Greff , and physics belbute .
However, while these methods generally perform complete scene modeling, HyPE represents a guided means of targeting modeling towards specific contingent regions, and learning modeling in a factored and sequential manner, both of which reduce the problem complexity drastically. In addition, HyPE avoids more scene complexity by learning policies to produce specific control without requiring complex modeling of pixel forward dynamics.
Exploration: This work incorporates ideas from exploration in reinforcement learning literature by designing a system for generating intrinsic rewards which guide an agent towards interesting. Early exploration work involves trying to uniformly explore actions by epsilon greedy policies sutton2018reinforcement , state liu2018learning , or policy space eysenbach2018diversity , sometimes incorporating entropy in the optimization mnih . Similar to state uniformity, other exploration work exploits novelty by learning hash functions to search for novel states ostrovski ; Tang ; burda , or replicating actions to return to partially explored regions savinov ; ecoffet . Alternatively “curiosity” or “surprise” is used, by training an internal model to predict state, and rewarding reching states which high predictive error burda2018large ; pathak2017curiosity. The HyPE algorithm differs from these exploration schemes in that it is a directed exploration that uses physical priors to target object interactions, rather than novelty or curiosity. Our method does guide towards changepoints, which bears some similarity to curiosity based methods.
Other methods might also direct exploration towards certain values, for example, by trying to explore towards subgoals from hindsight andrychowicz , using a learned controller to direct exploration vezhnevets , bottleneck regions bacon or contingency bellemare ; choi2019contingency . The HyPE algorithm is most similar to the last group, in that we direct exploration towards a particular kind of interesting state: controllable changepoints.
Hierarchical RL: Learning hierarchies of control with options has been studied in detail sutton ; bacon , and can be used to define a system for learning skills and state spaces konidaris ; kulkarni2016hierarchical ; barreto2017successor ; mankowitz . Other methods use a hierarchy involving a high level controller which gives target states, and a lowlevel controller that navigates to them vezhnevets2017feudal ; arulkumaran2016classifying . While the HyPE loop also builds a hierarchy of interations, it uniquely exploits the combination of broad physical assumptions and hierarchical reinforcement learning to achieve state and action abstraction.
Modelbased RL: Model based reinforcement learning using a learned model has recent shown substantial improvements in reinforcement learning sample efficiency in Atari games kaiser . These methods often learn to predict future raw state silver2017predictron , or latent space watter2015embed in tandem to learning a useful policy schmidhuber2015learning . They then incorporate planning, possibly in the latent space piergiovanni2018learning to reach their goal lowrey2018plan . The agent can also be traied to trade off reliance on the environment model racaniere2017imagination ; nagabandi2018neural , or use multiple models rajeswaran2016epopt for robustness. HyPE makes loose use of modeling to generate different options, but once it learns control policies, it applies model free reinforcement learning. As a result, HyPE should be able to take advantage of model based RL to improve sample effiency and accuracy further.
6 Conclusion
We introduced the HyPE algorithm, which incorporates general purpose priors about the world, such as proximity, object factorization, and quasistatic assumptions, in order to efficiently learn to hierarchically explore and control its environment. Though this system requires several limiting assumptions, making it less application agnostic than classic generalpurpose RL algorithms, these assumptions are reasonable for physical domains and lead to sample efficiency that is roughly an order of magnitude better than baseline RL methods. Future work can aim to address the practical issues required to extend HyPE to successfully work in physical realworld domains, such as robotic manipulation. Furthermore, the causal graph structure generated by HyPE may have implications for both explainable AI as well as transfer learning that can be explored.
Acknowledgments
The authors are supported by the ONR through the National Defense Science And Engineering Graduate Fellowship (NDSEG) program. This work has taken place in the Personal Autonomous Robotics Lab (PeARL) at The University of Texas at Austin. PeARL research is supported in part by the NSF (IIS1724157, IIS1638107, IIS1617639, IIS1749204) and ONR(N00014182243).
References
 (1) Pulkit Agrawal, Ashvin V Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke by poking: Experiential learning of intuitive physics. In Advances in Neural Information Processing Systems, pages 5074–5082, 2016.
 (2) Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058, 2017.
 (3) Kai Arulkumaran, Nat Dilokthanakul, Murray Shanahan, and Anil Anthony Bharath. Classifying options for deep reinforcement learning. arXiv preprint arXiv:1604.08153, 2016.
 (4) PierreLuc Bacon, Jean Harb, and Doina Precup. The optioncritic architecture. In ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
 (5) André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado P van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pages 4055–4065, 2017.
 (6) Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks for learning about objects, relations and physics. In Advances in neural information processing systems, pages 4502–4510, 2016.
 (7) Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro SanchezGonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
 (8) Marc G Bellemare, Joel Veness, and Michael Bowling. Investigating contingency awareness using atari 2600 games. In TwentySixth AAAI Conference on Artificial Intelligence, 2012.
 (9) Lars Buesing, Theophane Weber, Yori Zwols, Sebastien Racaniere, Arthur Guez, JeanBaptiste Lespiau, and Nicolas Heess. Woulda, coulda, shoulda: Counterfactuallyguided policy search. arXiv preprint arXiv:1811.06272, 2018.
 (10) Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A Efros. Largescale study of curiositydriven learning. arXiv preprint arXiv:1808.04355, 2018.
 (11) Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
 (12) Christopher P Burgess, Loic Matthey, Nicholas Watters, Rishabh Kabra, Irina Higgins, Matt Botvinick, and Alexander Lerchner. Monet: Unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390, 2019.
 (13) Michael B Chang, Tomer Ullman, Antonio Torralba, and Joshua B Tenenbaum. A compositional objectbased approach to learning physical dynamics. arXiv preprint arXiv:1612.00341, 2016.
 (14) Michael B Chang, Tomer Ullman, Antonio Torralba, and Joshua B Tenenbaum. A compositional objectbased approach to learning physical dynamics. arXiv preprint arXiv:1612.00341, 2016.
 (15) Silvia Chiappa, Sébastien Racaniere, Daan Wierstra, and Shakir Mohamed. Recurrent environment simulators. arXiv preprint arXiv:1704.02254, 2017.
 (16) Jongwook Choi, Yijie Guo, Marcin Lukasz Moczulski, Junhyuk Oh, Neal Wu, Mohammad Norouzi, and Honglak Lee. Contingencyaware exploration in reinforcement learning. 2019.
 (17) Caleb Chuck, Michael Laskey, Sanjay Krishnan, Ruta Joshi, Roy Fox, and Ken Goldberg. Statistical data cleaning for deep learning of automation tasks from demonstrations. In 2017 13th IEEE Conference on Automation Science and Engineering (CASE), pages 1142–1149. IEEE, 2017.
 (18) Filipe de Avila BelbutePeres, Kevin Smith, Kelsey Allen, Josh Tenenbaum, and J Zico Kolter. Endtoend differentiable physics for learning and control. In Advances in Neural Information Processing Systems, pages 7178–7189, 2018.
 (19) Carlos Diuk, Andre Cohen, and Michael L Littman. An objectoriented representation for efficient reinforcement learning. In Proceedings of the 25th international conference on Machine learning, pages 240–247. ACM, 2008.
 (20) Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Goexplore: a new approach for hardexploration problems. arXiv preprint arXiv:1901.10995, 2019.
 (21) SM Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari, Geoffrey E Hinton, et al. Attend, infer, repeat: Fast scene understanding with generative models. In Advances in Neural Information Processing Systems, pages 3225–3233, 2016.
 (22) Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
 (23) Marta Garnelo, Kai Arulkumaran, and Murray Shanahan. Towards deep symbolic reinforcement learning. arXiv preprint arXiv:1609.05518, 2016.
 (24) Klaus Greff, Raphaël Lopez Kaufmann, Rishab Kabra, Nick Watters, Chris Burgess, Daniel Zoran, Loic Matthey, Matthew Botvinick, and Alexander Lerchner. Multiobject representation learning with iterative variational inference. arXiv preprint arXiv:1903.00450, 2019.
 (25) Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018.
 (26) Nikolaus Hansen and Andreas Ostermeier. Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation. In Proceedings of IEEE international conference on evolutionary computation, pages 312–317. IEEE, 1996.
 (27) Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 (28) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, volume 3, 2017.
 (29) Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al. Modelbased reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019.
 (30) Ken Kansky, Tom Silver, David A Mély, Mohamed Eldawy, Miguel LázaroGredilla, Xinghua Lou, Nimrod Dorfman, Szymon Sidor, Scott Phoenix, and Dileep George. Schema networks: Zeroshot transfer with a generative causal model of intuitive physics. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1809–1818. JMLR. org, 2017.
 (31) George Konidaris. Constructing abstraction hierarchies using a skillsymbol loop. In IJCAI: proceedings of the conference, volume 2016, page 1648. NIH Public Access, 2016.
 (32) Adam Kosiorek, Hyunjik Kim, Yee Whye Teh, and Ingmar Posner. Sequential attend, infer, repeat: Generative modelling of moving objects. In Advances in Neural Information Processing Systems, pages 8606–8616, 2018.
 (33) Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pages 3675–3683, 2016.
 (34) Evan Zheran Liu, Ramtin Keramati, Sudarshan Seshadri, Kelvin Guu, Panupong Pasupat, Emma Brunskill, and Percy Liang. Learning abstract models for longhorizon exploration. 2018.
 (35) Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, and Igor Mordatch. Plan online, learn offline: Efficient learning and exploration via modelbased control. arXiv preprint arXiv:1811.01848, 2018.
 (36) Timothy A. Mann Mankowitz, Daniel J. and Shie Mannor. Adaptive skills adaptive partitions (asap). In Advances in neural information processing systems, 2016.
 (37) Ofir Marom and Benjamin Rosman. Zeroshot transfer with deictic objectoriented representation in reinforcement learning. In Advances in Neural Information Processing Systems, pages 2291–2299, 2018.
 (38) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
 (39) Roozbeh Mottaghi, Mohammad Rastegari, Abhinav Gupta, and Ali Farhadi. “what happens if…” learning to predict the effect of forces in images. In European Conference on Computer Vision, pages 269–285. Springer, 2016.
 (40) Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566. IEEE, 2018.
 (41) Scott Niekum, Sarah Osentoski, Christopher G Atkeson, and Andrew G Barto. Online bayesian changepoint detection for articulated motion models. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 1468–1475. IEEE, 2015.
 (42) Georg Ostrovski, Marc G Bellemare, Aäron van den Oord, and Rémi Munos. Countbased exploration with neural density models. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2721–2730. JMLR. org, 2017.
 (43) Judea Pearl. Causality. Cambridge university press, 2009.
 (44) AJ Piergiovanni, Alan Wu, and Michael S Ryoo. Learning realworld robot policies by dreaming. arXiv preprint arXiv:1805.07813, 2018.
 (45) Luis Piloto, Ari Weinstein, Arun Ahuja, Mehdi Mirza, Greg Wayne, David Amos, Chiachun Hung, and Matt Botvinick. Probing physics knowledge using tools from developmental psychology. arXiv preprint arXiv:1804.01128, 2018.
 (46) Sébastien Racanière, Théophane Weber, David Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adria Puigdomenech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imaginationaugmented agents for deep reinforcement learning. In Advances in neural information processing systems, pages 5690–5701, 2017.
 (47) Aravind Rajeswaran, Sarvjeet Ghotra, Balaraman Ravindran, and Sergey Levine. Epopt: Learning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283, 2016.
 (48) Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
 (49) Nikolay Savinov, Anton Raichuk, Raphaël Marinier, Damien Vincent, Marc Pollefeys, Timothy Lillicrap, and Sylvain Gelly. Episodic curiosity through reachability. arXiv preprint arXiv:1810.02274, 2018.
 (50) Jürgen Schmidhuber. On learning to think: Algorithmic information theory for novel combinations of reinforcement learning controllers and recurrent neural world models. arXiv preprint arXiv:1511.09249, 2015.
 (51) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 (52) Karthikeyan Shanmugam, Murat Kocaoglu, Alexandros G Dimakis, and Sriram Vishwanath. Learning causal graphs with small interventions. In Advances in Neural Information Processing Systems, pages 3195–3203, 2015.
 (53) Wieslaw Sienko, Wieslaw M Citko, and Bogdan M Wilamowski. Hamiltonian neural nets as a universal signal processor. In IEEE 2002 28th Annual Conference of the Industrial Electronics Society. IECON 02, volume 4, pages 3201–3204. IEEE, 2002.
 (54) Felipe Leno Da Silva and Anna Helena Reali Costa. Objectoriented curriculum generation for reinforcement learning. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 1026–1034. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
 (55) David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel DulacArnold, David Reichert, Neil Rabinowitz, Andre Barreto, et al. The predictron: Endtoend learning and planning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3191–3199. JMLR. org, 2017.
 (56) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
 (57) Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211, 1999.
 (58) Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # exploration: A study of countbased exploration for deep reinforcement learning. In Advances in neural information processing systems, pages 2753–2762, 2017.
 (59) Sam Toyer, Felipe Trevizan, Sylvie Thiébaux, and Lexing Xie. Action schema networks: Generalised policies with deep learning. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 (60) Sjoerd van Steenkiste, Michael Chang, Klaus Greff, and Jürgen Schmidhuber. Relational neural expectation maximization: Unsupervised discovery of objects and their interactions. arXiv preprint arXiv:1802.10353, 2018.
 (61) Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3540–3549. JMLR. org, 2017.
 (62) Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3540–3549. JMLR. org, 2017.
 (63) Tingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler. Nervenet: Learning structured policy with graph neural networks. 2018.
 (64) Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pages 2746–2754, 2015.
 (65) Nicholas Watters, Daniel Zoran, Theophane Weber, Peter Battaglia, Razvan Pascanu, and Andrea Tacchetti. Visual interaction networks: Learning a physics simulator from video. In Advances in neural information processing systems, pages 4539–4547, 2017.
 (66) Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart, et al. Relational deep reinforcement learning. arXiv preprint arXiv:1806.01830, 2018.
7 Breakout
8 Atari 2600 Breakout
While our version of Breakout has all the same intuition as the Atari 2600 version, in this work we do not use the traditional Atari 2600 version of the game. This is because it encroaches on 3 of our guiding assumptions. In particular, consistent properties, consistent relationships and the quasistatic property. In the case of visual structure, we assume that the visual scene is the same as the world state. However, in Atari 2600 objects can become occluded and disappear from the scene. Second, we assume that control is constant in terms of displacement, but this is not true in two ways. First, paddle control has a momentum value such that the first action or change of action produces partial changes to the paddle shape and position. Second, the ball changes velocity and appearance depending on the number of bounces that has been made overall in the episode. This nonstationarity also affects the quasistatic property, where the properties of the ball change not as a result of interaction with an object, but because of an internal counter. While we do not expect that these problems are insurmountable, solving them requires significant extensions to the existing system.
[H] \[email protected]@algorithmic \STATEInput: Raw actions ${A}_{\text{raw}}$, Environment $E$ \STATEInitialize: Dataset $\mathcal{D}=[]$, Graph $\mathcal{G}=\{\}$ \STATEAdd ${\mathcal{A}}_{\text{raw}}$ to $\mathcal{G}$ \STATECollect data from random policy: $\mathcal{D}\leftarrow {\mathcal{D}}_{{\pi}_{\text{Random Repeat}}}$ \REPEAT\STATEChoose node ${N}_{i}$ from $\mathcal{G}$ randomly \STATEGet ${\text{\mathbf{x}}}_{{o}_{i}}=\{{f}_{\theta}^{{o}_{j}}({x}_{\text{raw}}),\forall {x}_{\text{raw}}\in \mathcal{D}\}$ \STATEobject discovery Optimize ${\mathrm{arg}\mathrm{max}}_{\theta}L({x}_{{o}_{i}},{f}_{\theta}^{{o}_{j}}({x}_{\text{raw}})),\forall {x}_{{o}_{i}},{x}_{\text{raw}}\in {\text{\mathbf{x}}}_{{o}_{i}},\mathcal{D}$ (Equation 2) \IF$L({x}_{{o}_{i}},{f}_{\theta}^{{o}_{j}}({x}_{\text{raw}}))>{\tau}_{\text{vision}}$ \STATEHypothesis Proposal \STATEGet ${\text{\mathbf{C}}}_{{o}_{j}}=C({f}_{\theta}^{{o}_{j}}({x}_{\text{raw}}^{(t)})),\forall {x}_{\text{raw}}^{(t)}\in \mathcal{D}$ \STATEGet ${\text{\mathbf{S}}}_{{o}_{i},{o}_{j}}=S({x}_{{o}_{i}}^{(t)},{f}_{\theta}^{{o}_{j}}({x}_{\text{raw}}^{(t)})),\forall {x}_{\text{raw}}^{(t)}\in \mathcal{D},{x}_{{o}_{i}}^{(t)}\in {\text{\mathbf{x}}}_{{o}_{i}}$ \IF$\chi ({\text{\mathbf{C}}}_{{o}_{j}},{\text{\mathbf{S}}}_{{o}_{i},{o}_{j}})$ (Equation 3 \STATEDefine $H({x}_{{o}_{i}}^{(t,t+1)},{x}_{{o}_{j}}^{(t,t+1)})$ (Section 2.2) \STATEDefine all ${R}_{\mathrm{\Delta}{x}_{{o}_{j}}^{{d}_{k}}}({x}_{{o}_{i},{o}_{j}}^{(t,t+1)})$^{2}^{2} 2 if multiple control hypotheses and input state $[{x}_{{o}_{i}}{x}_{{o}_{j}},{\dot{x}}_{{o}_{j}},{x}_{{o}_{i}},{x}_{{o}_{j}}]$ ^{3}^{3} 3 where relevant, that is, only include terms where ${o}_{i}/{o}_{j}$ not abstract \STATEPerform RL to maximize all ${R}_{\mathrm{\Delta}{x}_{{o}_{j}}^{{d}_{k}}}$ \ENDIF\ENDIF\UNTILSufficient performance
9 Object Discovery F1 Score
The ${F}_{1}$ score is a common test for statistical significance, defined as $\frac{2\cdot \text{precision}\cdot \text{recall}}{\text{precision}+\text{recall}}$. Given event A, and its complement $\stackrel{~}{A}$, a classifier $C$ that assigns to $A$, and a counts $N(A),N(\stackrel{~}{A}),N({C}_{A})$, which are the number of true occurrances of $A$, $B$, and classification to $A$, then the precision is $\frac{N({C}_{A})}{N(A)}$, and recall is $\frac{N({C}_{A})}{N(A)+N(B)}$. In our case, event $A$ is $S({x}_{{o}_{i}}^{(t)},{f}_{\theta}^{{o}_{j}}({x}_{\text{raw}}^{(t)}))\wedge C({f}_{\theta}^{{o}_{j}}({x}_{\text{raw}}^{(t)}))$, which is interpreted as a timestep where ${x}_{{o}_{i}},{f}_{\theta}^{{o}_{j}}({x}_{\text{raw}})$ are salient^{4}^{4} 4 In this work, we use proximity for all objectobject interactions (such as the Paddle and Ball), and the quasistatic assumption for abstract objectobject interactions (such as the Actions and Paddle), and there is a changepoint in ${f}_{\theta}^{{o}_{j}}({x}_{\text{raw}})$. Given counting function $N$, and operating on a full dataset of input
$$\text{precision}=\frac{N(S({x}_{{o}_{i}}^{(t)},{x}_{{o}_{j}}^{(t)})\wedge C({x}_{{o}_{j}}^{(t)})}{N\left(C({x}_{{o}_{j}}^{(t)})\right)+N\left(S({x}_{{o}_{i}}^{(t)},{x}_{{o}_{j}}^{(t)})\right)}$$  (8) 
and
$$\text{recall}=\frac{N(S({x}_{{o}_{i}}^{(t)},{x}_{{o}_{j}}^{(t)})\wedge C({x}_{{o}_{j}}^{(t)})}{N\left(C({x}_{{o}_{j}}^{(t)})\right)}$$  (9) 
This precision and recall defines the desired ${F}_{1}$ metric ${F}_{1}({x}_{{o}_{i}},{f}_{\theta}^{{o}_{j}}({\text{\mathbf{x}}}_{\text{raw}}))$.
10 Object Discovery Smoothing
The optimization yields a candidate object hypothesis ${h}^{(*)}$ which fails on complicated frames due to the limited number of parameters. We correct the model by training a larger CNN to match the target heatmaps $z({h}^{(*)}({x}^{(t)}))$ generated by inserting a lowvariance normal curve centered at the coordinate ${h}^{(*)}({x}^{(t)})$ of each frame. The last filter is penalized with a lasso regularization resulting in Equation 10. Lastly, the object recognition is composed by $h({x}^{(t)})={\mathrm{arg}\mathrm{max}}_{y\in {A}_{\text{loc}}}{[f(x)]}_{y}$.
$$\begin{array}{c}\hfill {L}_{2}(f;{h}^{(*)})={f({x}^{(t)})z({h}^{(*)}({x}^{(t)}))}^{2}+{\lambda}_{2}{f({x}^{(t)})}_{1}\end{array}$$  (10) 
11 Object Discovery Hyperparameters
The initial population weights are sampled from a normal distribution $N(0,1)$. The hyperparameters are set as followed: the proximity threshold $\u03f5=6$, regularization coefficient ${\lambda}_{1}=20$, a population size of $50$, and run for $40$ epochs. An optimal filter can typically be picked up in this number of epochs, and the algorithm is not particularly sensitive to ${\lambda}_{1}$ or population size (though a bad choice of ${\lambda}_{1}$ will result in a stationary policy. The threshold for having passable F1 score for vision is $0.5$. Values below this threshold can generally be generated by randomly jittering the frame to match some of the desired properties.
12 Saliency Functions
We use as saliency functions: 1) a proximity indicator. That is, the salience function returns true or false for a time step to define a salient time, based on the evaluation:
$$  (11) 
when ${o}_{i},{o}_{j}$ are objects with locations. With abstract objects (such as the raw actions, that are always proximal, or just alternatively), we use a a quasistatic based saliency function that checks if there is a changepoint in the trajectory of ${o}_{i}$. That is, given a set of changepoints ${c}_{1},\mathrm{\dots}{c}_{n}$, the salience is:
$${S}_{C}({x}_{{o}_{i}}^{(t)},{x}_{{o}_{j}}^{(t)})=\{\begin{array}{cc}1\hfill & t\in \{{c}_{1},\mathrm{\dots}{c}_{n}\}\hfill \\ 0\hfill & \text{otherwise}\hfill \end{array}$$  (12) 
In terms of which salience to choose, we can simply apply the hypothesis testing or object detection with all salience functions until one sticks (though abstract objects don’t use proximity except as dummy salience). Finally, we can extend either salience function to include a window around salience, which can account for noise and some delayed reaction (i.e. calling the union of sets $\{{c}_{i}w,\mathrm{\dots},{c}_{i}+w\}$ salient). For control hypotheses (hypotheses involving $\mathrm{\Delta}{x}_{{o}_{j}}$, since we assume that object in contact are interacting if proximity has already been shown to produce changepoint interactions, we use proximity for salience to give a dense reward whenever the object has the desired motion, given proximal. This is useful for abstract objects, where we expect the abstract object to control or be controlled directly by some property of the premise object.
In practice, we use a distance of approximately 6 pixels to define proximity, and do not use a window around changepoints.
13 Network Ablations
We tried several other network architectures. In particular, we used a fully connected network, which struggled to interpret the ${x}_{{o}_{i}},{x}_{{o}_{j}}$ inputs, if included, but otherwise performed comparably (though with different, often simpler, output policies). We also tried a transformer style network, which computed keys, queries and values for each of the inputs. However, the size of our network was limited by using CMAES, and performance on comparably sized networks was strictly worse. This network was simplification, where the mean operation allows the network to balance the inputs without fixating on any one of them. We tried different sizes of maps (other than 128), which the network is fairly agnostic to. Even a size $32$ map can learn useful behavior that achieved decent reward ($>100$). The maximum size is around 512, limited by the number of parameters in the network that was viable for CMAES optimization.
14 Argument for CMAES
We tried a variety of policy gradient and deep Qlearning methods to train on some function of the input parameters, without any success, using the true object locations to remove possible noise from the object detection system. This suggests that learning policies on object locations in a reasonable number of frames seems to be a difficult for generic deep RL algorithms, at least for choosing a good set of hyperparameters. We attempted A2C, PPO, Rainbow, DQN, SARSA, SARSA with functional basis, and tabular Q learning. CMAES perform well on this reduced input space, because it requires matrix inversion, an ${n}^{3}$ operations. This capped the parameter number of the network to $11$k parameters, which was probably not sufficient to perform well if learning from raw images.
15 Baseline architecture
Our base architecture for the baseline networks (A2C, PPO) consisted of a 8x8, 32 map filter with stride 4, followed by a 5x5, 64 map filter with stride 2, and a 3x3, 64 map filter of stride 1, followed by a linear layer from the output of the last convolutional layer to size 512. This layer is fully connected to actor and critic components. This architecture has been used in many Atari Deep RL environments, and is the default Atari network for the Google dopamine framework: https://github.com/google/dopamine
16 Learning Parameters for CHAMP and DPGMM
The system is fairly agnostic to the parameters for CHAMP and DPGMM, though a bad choice can produce bad behavior. CHAMP requires 5 parameters, and the displacement model requires an additional parameter. The champ parameters are: a guess at the mean length of a segment, the variance, the minimum segment length, the maximum number of particles, and the resampled particles per time step. For the mean and variance, we chose a reasonable value: 10, 10, but the method is fairly agnostic to these. For minimum segment length, we used 1, and using a different value will be damaging. The max particles was 100, with 100 as resampleable. Any reasonably large choice will perform fine. For model variance (the penalty for modeling errors), we used $0.01$, which is agnostic to around $0.1$, where segments become agnostic to the input, and $0.001$, which exhibits oversegmenting (segments whenever the model does not perfectly predict, regardless of noise).
For the DPGMMs, we use 10 means (it chooses the number that it needs), zero initial mean and initial covariance of 1e10. This initial covariance is low because otherwise all the means end up being a single point. We use the implementation on sklearn: https://scikitlearn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html#sklearn.mixture. BayesianGaussianMixture, with default parameters there.
17 Learning Parameters for CMAES in Paddle Bouncing
For CMAES, we use a sample length of 100 frames, which increases after 12 epochs by 100 frames. We train for 30 epochs total. This value does not really change learning, but it optimizes sample efficiency. We use a population size of 10, and initial variance of 1 (initial mean is the initializations of the network, which uses a small uniform random). We use a gamma of $0.9$ and evaluate fitness based on return, though the choice of gamma does not strongly affect learning.
18 Details for the HyPE loop
The HyPE loop randomly chooses to run vision learning on every node in the existing graph. It then tests to see if the vision loss F1 score is above the threshold. Then, it performs the hypothesis test, testing if the numbers are above the threshold. Significantly greater is only relevant when there are more than 10 occurrences in the total categories. For example, the ball must have been bounced off the paddle at least 10 times. Then, the threshold is such that the ratio must be greater than $0.7$. The check for a hypothesis about $\mathrm{\Delta}{x}_{{o}_{j}}$ is always run afterwards, which involves taking the models after a salient changepoint, and clustering on the model displacements. Any cluster with greater than 20 occurrences is taken. For example, with the ball bouncing off the paddle, there are four clusters after the changepoint corresponding to the four angles at which the ball can bounce. Learning of multiple options involves switching between the different options when relevant (either after a changepoint with proximitybased saliency, or after a fixed number of timesteps with actionbased saliency).
19 Details for Code
A series of code commands to run the two iterations of the HyPE loop are detailed in the README.md file in the code, as well as the requirements. Code can be found at https://github.com/CalCharles/contingencyoptions
20 Extension of base assumptions
While our base assumptions might seem restrictive, we believe that these assumptions still generalize to many problems, and especially real world domains. In general, the world can be reduced into objects whose properties do not spontaneously change: this assumption is the basis for tool use. The quasistatic property is generally true, since it is a formalization of the term “inanimate object”, which is a common class of objects to manipulate. Proximity in time and space is another well used assumption, which while not guaranteed to be true, certainl y is often true. Finally, while formally proving a relationship is difficult, learning a policy to produce a particular change loosely mirrors experimentation in the scientific method, without the same precision of design.
21 Extension of the HyPE algorithm on Breakout
Notice that the HyPE agent, while maximizing a reward derived from causing a ball changepoint near the paddle, incidentally maximizes the true reward. However, this does not prevent us from continuing to apply the HyPE loop, and eventually even learning that block objects have temporal proximity to the true reward. By observing the interactions between the ball and the paddle, the HyPE agent can propose a hypothesis to define the different angles at which the ball can come off the paddle. In fact, running the HyPE loop on this data results in learning 4 different ball angles (described in supplementary material). Additionally, HyPE can learn a relationship between the ball and the blocks, and the blocks and true reward. Using the ball angles as options, then, one can optimize the removal of blocks directly. However, limitations with the vision algorithm and with learning options to hit the ball at different angles prevents us from demonstrating this functionality. We also posit that the relationships can be learned bidirectionally, backward from the node corresponding to true reward.
Hypothesis  $\mathrm{\Delta}{x}_{{o}_{j}}$  $\mathrm{\Delta}{y}_{{o}_{j}}$ 

${H}_{{d}_{0}}({x}_{{o}_{i}},{x}_{{o}_{j}})$  1.94  0.01 
${H}_{{d}_{1}}({x}_{{o}_{i}},{x}_{{o}_{j}})$  0.0  0.0 
${H}_{{d}_{2}}({x}_{{o}_{i}},{x}_{{o}_{j}})$  1.88  0.0 
Hypothesis  ${x}_{{o}_{i}}{x}_{{o}_{j}}$  ${y}_{{o}_{i}}{y}_{{o}_{j}}$ 

$H({x}_{{o}_{i}},{x}_{{o}_{j}})$  3.94  2.87 
22 Details on Hypotheses and Paddle policies
The actionpaddle hypotheses use the control hypotheses as defined in the paper, with a saliency function of proximal, and control behaviors learned from DPGMM on the displacement models in the segments after paddle changepoints. The values turn out to be: $1.88,1.94,0$, where the right and left command in true space are $2,2,0$. For the ball, we also perform denoising by DPGMM, by taking the mean location when a changepoint occurs. This turns out to be $3.94,2.87$, which is approximately above and to the left of the paddle. this is because the filters only look at changepoints, and are not necessarily centered. The different approximately mirrors the ball offset relative to the paddle offset (the ball in learning is more to the left).