Hypothesis-Driven Skill Discovery for Hierarchical Deep Reinforcement Learning

  • 2019-07-08 22:33:11
  • Caleb Chuck, Supawit Chockchowwat, Scott Niekum
  • 0


Deep reinforcement learning encompasses many versatile tools for designinglearning agents that can perform well on a variety of high-dimensional visualtasks, ranging from video games to robotic manipulation. However, these methodstypically suffer from poor sample efficiency, partially because they strive tobe largely problem-agnostic. In this work, we demonstrate the utility of adifferent approach that is extremely sample efficient. Specifically, we proposethe Hypothesis Proposal and Evaluation (HyPE) algorithm, which utilizes a smallset of intuitive assumptions about the behavior of objects in the physicalworld (or in games that mimic physics) to automatically define and learnhierarchical skills in a highly efficient manner. HyPE does this by discoveringobjects from raw pixel data, generating hypotheses about the controllability ofobserved changes in object state, and learning a hierarchy of skills that cantest these hypotheses and control increasingly complex interactions withobjects. We demonstrate that HyPE can dramatically improve sample efficiencywhen learning a high-quality pixels-to-actions policy; in the popular benchmarktask, Breakout, HyPE learns an order of magnitude faster than common baselinereinforcement learning and evolutionary strategies for policy learning.


Quick Read (beta)

Hypothesis-Driven Skill Discovery for
Hierarchical Deep Reinforcement Learning

Caleb Chuck
Department of Computer Science
The University of Texas at Austin
Austin, TX 78705
[email protected]
&Supawit Chockchowwat
Department of Computer Science
The University of Texas at Austin
Austin, TX 78705
[email protected]
&Scott Niekum
Department of Computer Science
The University of Texas at Austin
Austin, TX 78705
[email protected]

Deep reinforcement learning encompasses many versatile tools for designing learning agents that can perform well on a variety of high-dimensional visual tasks, ranging from video games to robotic manipulation. However, these methods typically suffer from poor sample efficiency, partially because they strive to be largely problem-agnostic. In this work, we demonstrate the utility of a different approach that is extremely sample efficient. Specifically, we propose the Hypothesis Proposal and Evaluation (HyPE) algorithm, which utilizes a small set of intuitive assumptions about the behavior of objects in the physical world (or in games that mimic physics) to automatically define and learn hierarchical skills in a highly efficient manner. HyPE does this by discovering objects from raw pixel data, generating hypotheses about the controllability of observed changes in object state, and learning a hierarchy of skills that can test these hypotheses and control increasingly complex interactions with objects. We demonstrate that HyPE can dramatically improve sample efficiency when learning a high-quality pixels-to-actions policy; in the popular benchmark task, Breakout, HyPE learns an order of magnitude faster than common baseline reinforcement learning and evolutionary strategies for policy learning.


capbtabboxtable[][\FBwidth] \captionsetupbelowskip=0pt


Hypothesis-Driven Skill Discovery for
Hierarchical Deep Reinforcement Learning

  Caleb Chuck Department of Computer Science The University of Texas at Austin Austin, TX 78705 [email protected] Supawit Chockchowwat Department of Computer Science The University of Texas at Austin Austin, TX 78705 [email protected] Scott Niekum Department of Computer Science The University of Texas at Austin Austin, TX 78705 [email protected]


noticebox[b]Preprint. Under review.\[email protected]

1 Introduction

While recent advancements in deep reinforcement learning (RL) have been used to obtain exciting results on a variety of high-dimensional visual tasks, ranging from video games to robotic manipulation, these algorithms often require large amounts of data in order to achieve good performance. In many real-world tasks, such as robotics, this data is difficult and expensive to collect in large quantities. Though one of many factors, one cause of this high sample complexity is that deep RL algorithms typically strive to be as problem-agnostic as possible. Generality, while a laudable goal, often ignores powerful inductive biases—some of which may work well across large collections of problems of interest. In this work, we focus on object-centric tasks that (approximately) obey a set of common-sense physical laws; for example, we assume that objects have a consistent visual appearance and have properties (for instance, position or velocity) that do not change unless acted upon by another object. Many physical tasks admit an object based latent space with some consistent visual appearances, in which case factorizing the state using these assumptions can greatly reduce state space complexity. Furthermore, though there is a combinatorially large number of possible arbitrary hypotheses, many physical systems only have a small number of interesting circumstances under which object changes will occur (i.e., contant when push an object, or using a key to open a door). Thus, our assumptions direct exploration to target these states over any arbitrary unseen state. In exchange for this limiting set of assumptions, we will demonstrate that it is possible to realize dramatic gains in sample efficiency for problems in which these assumptions hold. Furthermore, we argue that this set of assumptions is quite reasonable for a wide range of tasks, ranging from pseudo-physical video games to robotic manipulation.

Figure 1: The HyPE (Hypothesis proposal and evaluation) loop. (1) Object Discovery: Visually discovers a new object that appears to undergo dynamical changes caused by an existing object (e.g. seeing the ball bounce off the paddle); (2) Hypothesis proposal: propose a hypothesis about causal control between a proposed object and the new object (e.g. the paddle causes a certain); (3) Verifies the hypothesis by learning a new skill that controls the new object (if possible) using the proposed object (e.g. learning to reliably bounce the ball at several different angles, using paddle-movement skills as primitive actions).Each iteration of the HyPE loop tries to learn a new object interaction, and repeated iterations builds up hierarchical control over different objects. Section 3 expands on each step in detail.

To leverage the problem structure provided by objects in physical reinforcement learning tasks, we introduce the Hypothesis Proposal and Evaluation (HyPE) algorithm. As an example of the type of learning process we wish to capture, consider a human learning to play the Atari game Breakout, in which the player must control a paddle to bounce a ball, which in turn, can destroy bricks at the top of the screen. By experimenting with random inputs, the player observes the ball and paddle moving, quickly identifying them as objects that obey pseudo-physical laws. However, only the paddle motion appears to be directly correlated with the controller inputs, so the player attempts to learn to control it first. As they do, they observe that the ball bounces off the paddle and destroys the bricks, which provides reward. Recognizing that they cannot affect the bricks through the paddle directly, the player learns to control the ball via bouncing with the paddle. Finally, once the player has learned to control the ball, they can learn strategies to aim and destroy bricks quickly, completing the game.

HyPE formalizes the intuitions behind such a learning process in a 3-step learning loop:

  1. 1.

    State abstraction via object discovery: The first step of the HyPE loop discovers objects from raw pixel inputs by learning convolutional filters that meet certain physics-guided criteria. Reasoning about objects provides a factorization of state that can circumvent the need to experience dense samples from the full distribution of possible states. Instead, HyPE learns to control simpler object-object interactions that each only rely on a small subset of state information. For example, changing the directional velocity of the ball with the paddle is only dependent upon the paddle and ball positions, and not the wall or bricks.

  2. 2.

    Skill proposal via hypothesis generation: The second step of the HyPE loop proposes hypotheses about what caused the changes it observed in object properties. Namely, it hypothesizes about whether the change is controllable and, if so, which object-object interactions can control it. HyPE then converts each hypothesis into a goal for a corresponding skill/option sutton , such that the control hypothesis becomes testable. For example, hypothesizing that the paddle can be used to change the directional velocity of the ball produces a goal of bouncing the ball.

  3. 3.

    Hierarchical skill learning via hypothesis evaluation: The third step of the HyPE loop uses RL to try to learn the options proposed in the previous step, thus testing the generated hypotheses. Successful learning of this option confirms the interaction hypothesis. However, for learning to be tractable, a given option often uses other previously learned options as primitive actions, leading to a hierarchy of skills. For example, an option for destroying a brick can use the ball-control options as primitive actions, which in turn use paddle-control options as primitive actions, which in turn use raw controller inputs.

We demonstrate that HyPE improves the sample efficiency of policy learning on the classic arcade game of Breakout from raw pixels by an order of magnitude, as compared to baseline deep RL algorithms. In addition, we show that HyPE automatically constructs a control structure which describes and characterizes several intuitive components of the game, providing evidence that HyPE contains the right set of inductive biases to serve as a foundation for scaling RL to real-world manipulation tasks.

2 Assumptions

We frame our problem as a Markov Decision Process (MDP). At each time step t, the agent observes state xX, with starting state distribution X0, and takes action aAraw. The next state is determined by T(x,a,x), which is the probability of experiencing subsequent state x given current state x and current action a. A policy π(x,a) is defined as P(a|x), or the probability of an action a given state x. The agent receives external reward as a function of the current state and action R(x,a). From this, we can define the return as the discounted future reward: rt=i=tTγi-tR(xi,ai), where γ is a given discount factor. Reinforcement learning problem tries to find the policy that maximizes this total expected return.

To solve this MDP efficiently, our proposed algorithm generates and tests hypotheses about relationships between actions and objects in the world based on several assumptions. These assumptions guide where interesting events will occur (locations), when to look for such events (salient times), and how to check for controllability (option policy learning).

In our instantiation of HyPE, we define f(oi) to be a function that finds the location of object oi in an input image, represented as xoi(t)—the x,y coordinates of that object (time indexed by t). Δxoi represents a change in position. In general, f(oi) and xoi(t) can encompass a broader set of object properties, rather than only position. The following subsections outline additional assumptions made by our instantiation of HyPE, but note that many of these assumptions can similarly be generalized beyond properties such as position.

2.1 Object Structure

Our first assumption is that the world is comprised of objects with consistent properties and relationships. We make this assumption not only to provide a basis for factorizing the state space, but also to factorize the action space. We use xraw to denote the input state, an image.

Consistent Visual Properties: We assume there exists a function foi(xraw)Axoi, which maps from the raw state xraw, to the location xoi of object oi. This implies that visual cues can be used to determine the location of an object in every state. A

Object Control Relationships: We say that there is a relationship between two objects oi,oj If one object oi can be used to control another object oj. This relationship is characterized by at least one control Δxoj, and corresponding policy with an action space Aoi defined over oi (manipulation of the first object), which produces Δxoj. This should define at least two behaviors: the natural behavior in oj when not controlled by oi, and the control policy behavior caused by oi. This is formalized by contrasting the control policy πAoi(Δxoj)(xraw,a) with a random policy πrandom(xraw,a), which has an ϵπ lower probability of producing the control behavior. In other words, the control policy has the property (with time horizon T):

P(xoj(t+T)-xoj(t)=Δxoj|πAoi(Δxoj)(xraw,a))-P(xoj(t+T)-xoj(t)=Δxoj|πrandom(xraw,a))ϵπ (1)

The set of possible Δxoj define ways to manipulate xoj within a set time horizon. If πAoi(Δxoj) is learned for at least one Δxoj, we can use these Δxoj an action space Axoj on object oj. In HyPE, oi and oj are only related if changes in oj can be induced by Axoi. Since xoj is a position, Δxoj is a displacement.

We can also treat the actions taken on the base MDP Araw and the extrinsically defined environment reward R(x,a) (the score in video games) as special types of objects. The state, xoi of these objects is the instantiation taken at time t (action taken and reward received respectively). These are “abstract objects”, since they are not observable in xraw.

2.2 Proximity

We will make use of a saliency S:{xoi(t)×xoj(t)}t0,T{0,1}T, which determines timesteps where one object is likely to effect change in another object. While in general, any two objects might interact at any time, we use spatial proximity and the quasi-static property (defined in the next section) to limit the search for object interactions. This narrowing of the search space of object interactions makes search more tractable, and limits the number of spurious relationships that could confuse a learner.

Spatial proximity uses xoi (the location of object oi) from Section 2.1. For objects oi and oj, we can define the proximity of these objects to be if xoi-xoj2ϵ, that is, the l2-norm is less than some epsilon.

Note that we implicitly assume temporal proximity in the defining saliency, since we say that certain timesteps are salient. This implies that object effects (changes) and salient events co-occur within a short time window.

2.3 Quasi-static Property

We additionally assume that some properties of objects are quasi-static—they do not change unless acted upon by another object. By making this assumption, we gain a clear indicator of when an interaction has occurred, and use that information to find an explanation for this change. This property is also physically supported: most objects do not spontaneously exhibit changes, especially in physical scenarios. Furthermore, even if such spontaneous changes occur (i.e. a chemical reaction), this property can be used to reason about the untracked variables that induced that change.

To define changepoints, we use the formulation from Changepoint Detection using Approximate Model Parameters (CHAMP) niekum . This formulation finds m changepoints in a trajectory, {c1,,cm}, where changepoints are determined such that within each segment {xoi(t)}t{ckck+1}, a model M has the property M(xoi(t))xoi(t+1), along with regularization of and constraints on the number of models. In other words, in segments between changepoints, a simple model can roughly predict the next state. Our model choice is: Md(xoi)=xoi+d, a fixed vector d-displacement model.

In this work, we use the quasi-static property in two ways. First, the quasi-static property assumes that changepoints are caused by an object-object interactions. Second, we use the quasi-static property to assert that object changepoints are salient times to search for changes in other objects (for example, flipping a switch might be a good time to check if anything in the environment changes).

Figure 2: The causal graph for the atari game breakout, specifying all of the object relationships. Grey objects and edges are those not necessary to achieve high extrinsic performance (though the blocks and reward are learnable). The walls are not learned by HyPE since they cannot be controlled, though as objects they can affect the ball.

2.4 Contingency

While there may be many objects in the world, we can limit the ones we are interested in by searching only for objects that are contingent, that is, objects we can control. By doing this, we avoid the problem of learning the full forward dynamics of a system, which might be very difficult, especially in a model free setting. At the same time, we still learn about all the objects the agent is able to effect in the environment.

We define contingent objects recursively: a contingent object can either be controlled directly by the actions available to the agent, Araw, or through control of a different contingent object. For example, the ball in Breakout is contingent because it can be controlled through the paddle (which is controlled directly by Araw), but the walls are not because they cannot be controlled by any contingent object.

To codify these object interactions, we use a graph 𝒢. A node Ni of 𝒢 corresponds to an object oi. A directed edge between a node only exists if a control policy as defined in Equation 1 has been learned. An object oj is contingent if there exists a path from the abstract node corresponding to the raw actions (call this 𝒜raw), and Nj. HyPE only attempts to learn edges from contingent nodes. The graph structure for Breakout is shown in Figure 2.

2.5 Option-based Causal Discovery

Finally, in HyPE we assume that if we can learn at least one πAoi(Δxoj)(xraw,a) from a contingent node oi, then object oi and object oj has a causal relationship where oi causes Δxoj (adding an edge between Ni and Nj in the graph defined in Section 2.4). This is because at least one πAoi(Δxoj)(xraw,a) implies there are at least two ways to control oj: Δxoj with a probability higher than ϵπ, using piAoi(Δxoj), and causing Δxoj with a probability higher than ϵ with a probability lower than ϵπ using πrandom. Recall that πAoi(Δxoj)(xraw,a) produces Δxoj with ϵpi higher probability than a baseline policy (typically the random policy). This means that through displacements of oi, this learned policy can do nontrivial control of oj, which define an action space over the target object Aoj.

3 Proposed Algorithm

Our system involves a three-step hypothesis proposal and evaluation loop (HyPE loop). We define a hypothesis as a Boolean statement about a relationship between two objects oi,oj. At each full iteration of the loop, the system adds edges to the graph 𝒢 as described in Section 4. The HyPE loop’s three steps are:

  1. 1.

    Object discovery, which locates a new object oj by learning a new foj(xraw).

  2. 2.

    Hypothesis proposal, which uses past data (possibly collected using prior iterations of HyPE) to define a hypothesis H about where and how two objects interact.

  3. 3.

    Hypothesis verification, where the agent learns to reliably reproduce the interaction between two objects.

Over multiple iterations of the HyPE loop, an object interaction graph 𝒢 as defined in section 2.4 is generated, one edge at a time. We start with a single node, 𝒜raw, and use it to discover filters to locate new objects in the frame, and learn how those objects can be controlled.

3.1 Object Discovery

This step of the algorithm seeks to visually identify objects that are controllable, either directly or via another object. This corresponds to searching for visual features whose movements can be explained by either direct action inputs or interactions with contingent objects. We make this into an optimization problem, in which we try to learn foj(xraw), a filter on the input image that identifies the location of an object that meets the aforementioned controllability criteria. We structure this section by describing the following:

  1. 1.

    Using the quasi-static property to find changes in a new object

  2. 2.

    Incorporating saliency with another object can be used to explain these changes.

  3. 3.

    Combine these into a loss term

  4. 4.

    Add smoothness to define the full loss function

  5. 5.

    Optimize the defined loss term

Visual Changepoints: The function foj:xraw2, represented by a convolutional neural network (CNN), outputs locations xoj(t) at every time-step (initially, these locations will probably be random, since the initial network is random, but given successful training, will track one of the objects). Taking this time series sequence of locations, we apply CHAMP to get a sequence of changepoints τ1,τm. We define C(xoj(t))):2{0,1} to be an indicator of whether there is a changepoint at time t in the sequence {xoj}t[1,,T]. From the quasi-static property, we expect that some kind of object interactions produced these changepoints.

Explaning Changepoints: Given the changepoints of oj, we search for other objects which might have caused some of these changepoints. We narrow our search by restricting possible object interactions to salient regions. If oi to interacts with oj at time t we expect to see changepoints C(xoj(t)))=1 at a salient point. Define S(xoi(t),xoj(t)):2×2{0,1} to be an indicator if a point is salient. If we want all changepoints to be explained by interaction with object oj, this entails C(xoj(t)))S(xoi(t),xoj(t)) for all t.

Changepoint Loss: However, the above comparison assumes that all changepoints are explained by salient points, and all salient points are accompanied by changepoints. This is not always (or often) the case. For example, the ball might bounce off of objects other than the paddle (a changepoint without being salient), and the ball might come close to the paddle without interacting with it (a salient point that is not a changepoint). To account for this, we search for statistically significant amounts of interaction from our historical data (data collected by prior iterations of the HyPE loop). We take the number of changepoints that occur when salient N(S(xoi(t),xoj(t))C(xoj(t))), and compare with the number of changepoints N(C(xoj(t))) and the number of salient points N(S(xoi(t),xoj(t))).

To search for statistical significance, we use the F1 score F1(𝐱oi,𝐱oj) (defined in the appendix) between these three terms. The F1 score measures the statistical significance of events being correlated. Since the F1 score is computed over counts on the whole dataset (length n), we use 𝐱o to denote the full dataset of object positions. Notice that we want the number of explained changepoints (N(S(xoi(t),xoj(t))C(xoj(t)))) to be significant relative to the total number of changepiont N(C(xoj(t))), so that the learned filter does not simply make every frame a changepoint to optimize the explained changepoints. Similarly, we do not want to learn another feature which tracks oj, so we also want the number of explained changepoints to be significant relative to the number of salient points N(S(xoi(t),xoj(t))). The F1 score gives an an optimizable value which balances these objectives.

Loss function: We add an additional regularization to prevent long jumps, which are not physical. foj(xraw(t+1))-foj(xraw(t)). This regularizes displacement of xoj (controlled by λ1). This gives our full objective for object discovery:

L(𝐱oi;𝐱oj)=F1(𝐱oi;foj(𝐱raw))-λ1t=0n-1foj(xraw(t+1))-foj(xraw(t)) (2)

Optimization: We optimize this objective with covariance matrix adaptation evolution strategy (CMA-ES,  hansen ). In addition, we smooth the outputs to clean up relatively rare cases where the filter fails, an operation has been shown to improve performance on downstream deep learning policy tasks chuck . We represent fθoj with a two layer convolutional network with 3×3 and 5×5 filters.

Figure 3: Images from the object discovery step of HyPE, where the cross-hairs show the location of the object. The bottom row shows the heatmap of the filter. On the left, is tracking of the paddle location, and the right shows tracking of the ball

3.2 Hypothesis Proposal

We generate hypotheses about whether a causal relationship exists between two objects. In section 2.1, we describe object relations as being control of one object through actions defined over another. Thus, to propose a hypothesis, we need to first check if the proposed hypothesis has reasonable evidence from historical data, and then generate a function which indicates when a control produced by another object occurs. In this section, we will describe a means of generating two different kinds of control indicators: a changepoint control, and a displacement control.

Changepoint Control: Before generating a changepoint indicator, we need to determine whether a (controlled) changppoint occurs with statistical significance. This mirrors the F1 criteria defined in the last section, since the objective is closely related: we want to see if our premise affects our target often in data. However, since we are not trying to optimize the inputs, the HyPE hypothesis proposal step first tests:

χ(C(𝐱oj),S(𝐱oi,𝐱oj))=N(S(xoi(t),xoj(t))C(xoj(t)))ηN(S(xoi(t),xoj(t))¬C(xoj(t))) (3)

This indicates if a changepoint occurs at a salient time (S(xoi(t),xoj(t))C(xoj(t))) η times more often than not (S(xoi(t),xoj(t))¬C(xoj(t))). If this test fails, i.e., in Breakout when checking whether ball changepoints co-occur with actions, then it does not make sense to propose a hypothesis. That does not mean that a hypothesis does not exist—there might be insufficient data, but it does suggest searching for other, more likely object interactions. On the other hand, if the test passes this suggests that one object can be controlled by interaction with another, i.e., the ball can be controlled by bouncing with the paddle. Changepoint indicators take the form:

H(xoi(t),xoj(t))=S(xoi(t),xoj(t))C(xoj(t)) (4)

Which is a single state version of the Equation  3: observing if a changepoint C(xoj(t)) occurs at a salient time S(xoi(t),xoj(t)). In the case of the ball and paddle, it would return 1 (true) when the ball bounces off the paddle, and 0 (false) in other circumstances.

Displacement Control: However, we might want to do more than just induce an arbitrary changepoint. If supported by data, learning options which induce a particular control, in the form of a displacement, will give more options and varied control. For example, rather than simply training the paddle to change any direction, training control would give options that correspond to right, left or stationary movements.

Recall Δxoj=d is a displacement of the target object, and Md() is a fixed d-displacement model (see Section 2.3) for a segment which follows some changepoint ck (a timestep). Then, the control hypothesis checks if the segment after a salient changepoint S(xoi(t),xoj(t))C(xoj(t)) (as defined in the last section), is fit to a particular, desired d-displacement model:

Hd(xoi(t,t+1),xoj(t,t+1))=xoj(t+1)-xoj(t)-dϵS(xoi(t),xoj(t)) (5)

In the ball example, this would mean that after the ball bounces off the paddle, a control indicator will indicate true only if the angle of the ball bounce is equal to the desired ball bounce.

In order to decide which displacements to propose hypotheses for (given the check defined in Equation 3 has already passed), the HyPE algorithm takes the segment models Md following salient changepoints, and clusters them according to the d parameter using Dirichlet Process Gaussian Mixture Models (clustering is used to denoise). If the cluster corresponding to dk has a minimum number of assignments, that control hypothesis is proposed.

3.3 Hypothesis Evaluation

Now, we want to test whether the hypothesis about causal control between two objects is true. This is done by simply incorporating the boolean control indicators as reward functions for different options, output 1 if the hypothesis is true and 0 if false:

R(s(t))={1H(xoi(t,t+1),xoj(t,t+1))0otherwise (6)

We define our actions as the learned options, Aoi, over the premise object oi (not the raw actions Araw). This allows us to gain sample efficiency through action abstraction and hierarchy.

To gain sample efficiency through state abstraction, we use as input the locations of the premise and target objects. Though this assumption can be limiting (if there is some other, unknown object that affects the target, the policy does not account for this object), we use this abstraction because the performance benefits appear to outweigh this cost. Even with other objects that are unaccounted for, we still expect to induce a change more often than random, allowing future iterations of HyPE to learn about the unknown object(s).

In order to maximize the advantage of our state abstraction, we add several simple transformations of the inputs to our state space, namely velocity and relative position of the objects: xoj(t+1)-xoj(t) and xoi(t)-xoj(t). Our neural net architecture computes the following:

softmax(Wfeedforward1ki=0kσ(Wvectorϕk(xoi,xoj))) (7)

Where ϕ(,) are our input operations. This architecture converts each input state into a length 128 vector, where Wvector is a 128×2 matrix of weights (all input properties have dimension 2), and σ is some activation. It then takes the mean of all feature vectors, and feeds these forward to the outputs.

Figure 4: Performance comparison between PPO, A2C (orange), Rainbow (blue) and HyPE (maroon), evolutionary strategies salimans , CMA-ES and a baseline (10 trials each) of learning on the true paddle and ball locations (Base). The y-axis is average episode return, and the x-axis is number of frames experienced, on a log scale. HyPE outperforms Rainbow, the best performing baseline, by 7x in training and 18x in test. This difference is because HyPE uses CMA-ES as an optimizer, which tends to have significantly lower training performance than test, which is not as true for Q-learning methods like Rainbow. HyPE testing performance at 55k frames matches Rainbow training performance at 1m time steps (see Table 1).
Table 1: Table of training time to find evaluation policy with 244 blocks hit, the average test score of HyPE after 55,500 frames of training (standard error 27, 20 trials). “Base” is a CMA-ES algorithm run on the relative positions of the paddle and the ball, ball velocity, and ball and paddle positions, from the true underlying game state.
Algorithm Base HyPE Rainbow A2C & PPO
Timesteps 52,000 55,500 1,000,000 >1,500,000

4 Results

We demonstrate that HyPE can learn to achieve high performance on the classic game Breakout after two iterations of the algorithm loop. Both the visual and behavior components have intuitive interpretations, so we can observe the performance of the HyPE system as it progresses. As a step by step progression: since the HyPE loop has no initial information, it learns from 1k frames of data with a random policy, and then uses that information with the loss defined in Equation 2, and a prior object of oi=𝒜raw, which is the only node in the graph so far. Since the vision loss is above the threshold F1 score, ϵ, it checks and proposes a hypothesis, which results in three characteristic behaviors as defined in Equation 5 after clustering11 1 The HyPE loop automatically chooses a control hypothesis when it has control clusters with more than 10 assignments. Having finished step 2, the HyPE loop performs the learning of three hypothesis-generated options—moving the paddle ±2 or 0 pixels, respectively. These options use paddle position and velocity as input, and learning converges in 2.5k timesteps. Though this first step of the HyPE loop learns an intuitive first object (the paddle), this is not encoded explicitly anywhere in the algorithm, but emerges from physical priors and controllability. Because learning succeeds, the HyPE loop adds a new node, connected to 𝒜raw, which we call Npaddle.

Using the cumulative data, the HyPE loop then applies the vision loss to all nodes. 𝒜raw, with the paddle mean removed from the image, and does not meet the threshold to propose any new objects. However, Npaddle discovers a new object—the ball (the results of the vision step are shown in Figure 3). In this case, the ball discovery is a consequence of the changepoints it exhibits near the paddle. The subsequent hypothesis check for changepoints near the paddle passes and is used to generate a reward function for learning an option to bounce the ball off of the paddle. Using this learned option continuously is sufficient to achieve high extrinsic reward in a small number of frames.

In Figure 4, We show that HyPE has roughly an order of magnitude improvement in sample efficiency compared to baseline RL methods. HyPE, at train time, achieves average reward per episode of 17.5 in 55k frames, while Rainbow hessel takes 400k timesteps, and Proximal Policy Optimization schulman and A2C mnih take roughly 1.4M timesteps to achieve the same performance. However, CMA-ES, used to learn the HyPE bouncing policy, typically has higher test performance than train. Thus, we also show that the evaluation policy learned by HyPE after 55000 frames achieves 244 average reward per episode performance, which is more than an order of magnitude better than Rainbow, the best performing baseline.

Finally, in order to better understand the performance gains of HyPE, we demonstrate that a learner using the actual positions of the ball and the paddle (and relative positions) achieves performance similar to that learned by HyPE, as shown by the “Base” in Figure 4 and Table 1. This implies that the majority of the performance improvement comes from using the object relative input states. However, HyPE provides a principled method for learning and basis for using those object-relative input states, as well as the ability to perform targeted, hierarchical exploration.

5 Related Work

This work amalgamates ideas from several different sub-directions of reinforcement learning. In particular, it incorporates concepts from causal graphs, relational and object-oriented reinforcement learning, object-scene modeling literature, model based reinforcement learning, exploration methods in reinforcement learning and option-based or hierarchical reinforcement learning.

Causal graphs: The control graph structure and hypothesis verification components of HyPE draw upon ideas related to causality pearl . These components of HyPE relate to where graphs Shanmugam and graph-dependent policies Buesing ; wangnerve are learned from interactions with the environment. Graph network architectures provide a differentiable structure by which to learn causal object relationships battaglia ; garnelo , and have been used to pick up object relationships toyer for forward modeling through object oriented and relational reinforcement learning diuk ; zambaldi . Object-oriented and relational reinforcement learning often use pre-conditions and post-conditions to determine object relationships, and have been used for forward modeling watters ,curriculum generation silva and transfer marom . Alternatively, HyPE uses a similar object-oriented network structure for highly efficient hierarchical reinforcement learning and visual factorization.

Schema networks Kansky , combine intuitive physics with a process for learning causal networks for gaining sample efficiency on model transfer. These networks achieve results related to object interactions and useful control over them which parallels the learned object-interaction graph from the HyPE loop. However, Schema Networks do not learn objects from raw inputs or provide a curriculum of learning over different objects, preventing them from having the same sample-efficiency benefits on non-transfer problems as HyPE. To our knowledge, HyPE represents a unique method for sequentially generating a causal graph structure from the ground up.

Scene Modeling: Decomposing a scene into elements, and modeling the forward dynamics of those elements is a classic problem. HyPE approaches this problem in an unsupervised manner. Several works have shown promising results in using neural networks in the form of variational auto encoders higgins , or expectation maximization van2018relational to decompose scenes to latent spaces, sometimes factorized burgess . In addition to simply decoming the scene, some methods strive to learn a forward model, in the raw pixel space chiappa2017recurrent , the latent space hafner2018learning , using pairwise interactions chang2016compositional , graph networks battaglia2016interaction or Bayesian inference eslami2016attend ; kosiorek2018sequential . Some methods also incorporate physical properties, such as energy conservation sienko2002hamiltonian , force mottaghi2016happens or intuitive physics agrawal2016learning . Broad physical assumptions have been incorporated into work in probing intuitive dynamics Piloto , learning physical dynamics or relations Chang ; zambaldi , object representations Greff , and physics belbute .

However, while these methods generally perform complete scene modeling, HyPE represents a guided means of targeting modeling towards specific contingent regions, and learning modeling in a factored and sequential manner, both of which reduce the problem complexity drastically. In addition, HyPE avoids more scene complexity by learning policies to produce specific control without requiring complex modeling of pixel forward dynamics.

Exploration: This work incorporates ideas from exploration in reinforcement learning literature by designing a system for generating intrinsic rewards which guide an agent towards interesting. Early exploration work involves trying to uniformly explore actions by epsilon greedy policies sutton2018reinforcement , state liu2018learning , or policy space eysenbach2018diversity , sometimes incorporating entropy in the optimization mnih . Similar to state uniformity, other exploration work exploits novelty by learning hash functions to search for novel states ostrovski ; Tang ; burda , or replicating actions to return to partially explored regions savinov ; ecoffet . Alternatively “curiosity” or “surprise” is used, by training an internal model to predict state, and rewarding reching states which high predictive error burda2018large ; pathak2017curiosity. The HyPE algorithm differs from these exploration schemes in that it is a directed exploration that uses physical priors to target object interactions, rather than novelty or curiosity. Our method does guide towards changepoints, which bears some similarity to curiosity based methods.

Other methods might also direct exploration towards certain values, for example, by trying to explore towards sub-goals from hindsight andrychowicz , using a learned controller to direct exploration vezhnevets , bottleneck regions bacon or contingency bellemare ; choi2019contingency . The HyPE algorithm is most similar to the last group, in that we direct exploration towards a particular kind of interesting state: controllable changepoints.

Hierarchical RL: Learning hierarchies of control with options has been studied in detail sutton ; bacon , and can be used to define a system for learning skills and state spaces konidaris ; kulkarni2016hierarchical ; barreto2017successor ; mankowitz . Other methods use a hierarchy involving a high level controller which gives target states, and a low-level controller that navigates to them vezhnevets2017feudal ; arulkumaran2016classifying . While the HyPE loop also builds a hierarchy of interations, it uniquely exploits the combination of broad physical assumptions and hierarchical reinforcement learning to achieve state and action abstraction.

Model-based RL: Model based reinforcement learning using a learned model has recent shown substantial improvements in reinforcement learning sample efficiency in Atari games kaiser . These methods often learn to predict future raw state silver2017predictron , or latent space watter2015embed in tandem to learning a useful policy schmidhuber2015learning . They then incorporate planning, possibly in the latent space piergiovanni2018learning to reach their goal lowrey2018plan . The agent can also be traied to trade off reliance on the environment model racaniere2017imagination ; nagabandi2018neural , or use multiple models rajeswaran2016epopt for robustness. HyPE makes loose use of modeling to generate different options, but once it learns control policies, it applies model free reinforcement learning. As a result, HyPE should be able to take advantage of model based RL to improve sample effiency and accuracy further.

6 Conclusion

We introduced the HyPE algorithm, which incorporates general purpose priors about the world, such as proximity, object factorization, and quasi-static assumptions, in order to efficiently learn to hierarchically explore and control its environment. Though this system requires several limiting assumptions, making it less application agnostic than classic general-purpose RL algorithms, these assumptions are reasonable for physical domains and lead to sample efficiency that is roughly an order of magnitude better than baseline RL methods. Future work can aim to address the practical issues required to extend HyPE to successfully work in physical real-world domains, such as robotic manipulation. Furthermore, the causal graph structure generated by HyPE may have implications for both explainable AI as well as transfer learning that can be explored.


The authors are supported by the ONR through the National Defense Science And Engineering Graduate Fellowship (NDSEG) program. This work has taken place in the Personal Autonomous Robotics Lab (PeARL) at The University of Texas at Austin. PeARL research is supported in part by the NSF (IIS1724157, IIS-1638107, IIS-1617639, IIS-1749204) and ONR(N00014-18-2243).


  • (1) Pulkit Agrawal, Ashvin V Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke by poking: Experiential learning of intuitive physics. In Advances in Neural Information Processing Systems, pages 5074–5082, 2016.
  • (2) Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058, 2017.
  • (3) Kai Arulkumaran, Nat Dilokthanakul, Murray Shanahan, and Anil Anthony Bharath. Classifying options for deep reinforcement learning. arXiv preprint arXiv:1604.08153, 2016.
  • (4) Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  • (5) André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado P van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pages 4055–4065, 2017.
  • (6) Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks for learning about objects, relations and physics. In Advances in neural information processing systems, pages 4502–4510, 2016.
  • (7) Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
  • (8) Marc G Bellemare, Joel Veness, and Michael Bowling. Investigating contingency awareness using atari 2600 games. In Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012.
  • (9) Lars Buesing, Theophane Weber, Yori Zwols, Sebastien Racaniere, Arthur Guez, Jean-Baptiste Lespiau, and Nicolas Heess. Woulda, coulda, shoulda: Counterfactually-guided policy search. arXiv preprint arXiv:1811.06272, 2018.
  • (10) Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A Efros. Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355, 2018.
  • (11) Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
  • (12) Christopher P Burgess, Loic Matthey, Nicholas Watters, Rishabh Kabra, Irina Higgins, Matt Botvinick, and Alexander Lerchner. Monet: Unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390, 2019.
  • (13) Michael B Chang, Tomer Ullman, Antonio Torralba, and Joshua B Tenenbaum. A compositional object-based approach to learning physical dynamics. arXiv preprint arXiv:1612.00341, 2016.
  • (14) Michael B Chang, Tomer Ullman, Antonio Torralba, and Joshua B Tenenbaum. A compositional object-based approach to learning physical dynamics. arXiv preprint arXiv:1612.00341, 2016.
  • (15) Silvia Chiappa, Sébastien Racaniere, Daan Wierstra, and Shakir Mohamed. Recurrent environment simulators. arXiv preprint arXiv:1704.02254, 2017.
  • (16) Jongwook Choi, Yijie Guo, Marcin Lukasz Moczulski, Junhyuk Oh, Neal Wu, Mohammad Norouzi, and Honglak Lee. Contingency-aware exploration in reinforcement learning. 2019.
  • (17) Caleb Chuck, Michael Laskey, Sanjay Krishnan, Ruta Joshi, Roy Fox, and Ken Goldberg. Statistical data cleaning for deep learning of automation tasks from demonstrations. In 2017 13th IEEE Conference on Automation Science and Engineering (CASE), pages 1142–1149. IEEE, 2017.
  • (18) Filipe de Avila Belbute-Peres, Kevin Smith, Kelsey Allen, Josh Tenenbaum, and J Zico Kolter. End-to-end differentiable physics for learning and control. In Advances in Neural Information Processing Systems, pages 7178–7189, 2018.
  • (19) Carlos Diuk, Andre Cohen, and Michael L Littman. An object-oriented representation for efficient reinforcement learning. In Proceedings of the 25th international conference on Machine learning, pages 240–247. ACM, 2008.
  • (20) Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.
  • (21) SM Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari, Geoffrey E Hinton, et al. Attend, infer, repeat: Fast scene understanding with generative models. In Advances in Neural Information Processing Systems, pages 3225–3233, 2016.
  • (22) Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
  • (23) Marta Garnelo, Kai Arulkumaran, and Murray Shanahan. Towards deep symbolic reinforcement learning. arXiv preprint arXiv:1609.05518, 2016.
  • (24) Klaus Greff, Raphaël Lopez Kaufmann, Rishab Kabra, Nick Watters, Chris Burgess, Daniel Zoran, Loic Matthey, Matthew Botvinick, and Alexander Lerchner. Multi-object representation learning with iterative variational inference. arXiv preprint arXiv:1903.00450, 2019.
  • (25) Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018.
  • (26) Nikolaus Hansen and Andreas Ostermeier. Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation. In Proceedings of IEEE international conference on evolutionary computation, pages 312–317. IEEE, 1996.
  • (27) Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • (28) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, volume 3, 2017.
  • (29) Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019.
  • (30) Ken Kansky, Tom Silver, David A Mély, Mohamed Eldawy, Miguel Lázaro-Gredilla, Xinghua Lou, Nimrod Dorfman, Szymon Sidor, Scott Phoenix, and Dileep George. Schema networks: Zero-shot transfer with a generative causal model of intuitive physics. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1809–1818. JMLR. org, 2017.
  • (31) George Konidaris. Constructing abstraction hierarchies using a skill-symbol loop. In IJCAI: proceedings of the conference, volume 2016, page 1648. NIH Public Access, 2016.
  • (32) Adam Kosiorek, Hyunjik Kim, Yee Whye Teh, and Ingmar Posner. Sequential attend, infer, repeat: Generative modelling of moving objects. In Advances in Neural Information Processing Systems, pages 8606–8616, 2018.
  • (33) Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pages 3675–3683, 2016.
  • (34) Evan Zheran Liu, Ramtin Keramati, Sudarshan Seshadri, Kelvin Guu, Panupong Pasupat, Emma Brunskill, and Percy Liang. Learning abstract models for long-horizon exploration. 2018.
  • (35) Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, and Igor Mordatch. Plan online, learn offline: Efficient learning and exploration via model-based control. arXiv preprint arXiv:1811.01848, 2018.
  • (36) Timothy A. Mann Mankowitz, Daniel J. and Shie Mannor. Adaptive skills adaptive partitions (asap). In Advances in neural information processing systems, 2016.
  • (37) Ofir Marom and Benjamin Rosman. Zero-shot transfer with deictic object-oriented representation in reinforcement learning. In Advances in Neural Information Processing Systems, pages 2291–2299, 2018.
  • (38) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
  • (39) Roozbeh Mottaghi, Mohammad Rastegari, Abhinav Gupta, and Ali Farhadi. “what happens if…” learning to predict the effect of forces in images. In European Conference on Computer Vision, pages 269–285. Springer, 2016.
  • (40) Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566. IEEE, 2018.
  • (41) Scott Niekum, Sarah Osentoski, Christopher G Atkeson, and Andrew G Barto. Online bayesian changepoint detection for articulated motion models. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 1468–1475. IEEE, 2015.
  • (42) Georg Ostrovski, Marc G Bellemare, Aäron van den Oord, and Rémi Munos. Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2721–2730. JMLR. org, 2017.
  • (43) Judea Pearl. Causality. Cambridge university press, 2009.
  • (44) AJ Piergiovanni, Alan Wu, and Michael S Ryoo. Learning real-world robot policies by dreaming. arXiv preprint arXiv:1805.07813, 2018.
  • (45) Luis Piloto, Ari Weinstein, Arun Ahuja, Mehdi Mirza, Greg Wayne, David Amos, Chia-chun Hung, and Matt Botvinick. Probing physics knowledge using tools from developmental psychology. arXiv preprint arXiv:1804.01128, 2018.
  • (46) Sébastien Racanière, Théophane Weber, David Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adria Puigdomenech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imagination-augmented agents for deep reinforcement learning. In Advances in neural information processing systems, pages 5690–5701, 2017.
  • (47) Aravind Rajeswaran, Sarvjeet Ghotra, Balaraman Ravindran, and Sergey Levine. Epopt: Learning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283, 2016.
  • (48) Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
  • (49) Nikolay Savinov, Anton Raichuk, Raphaël Marinier, Damien Vincent, Marc Pollefeys, Timothy Lillicrap, and Sylvain Gelly. Episodic curiosity through reachability. arXiv preprint arXiv:1810.02274, 2018.
  • (50) Jürgen Schmidhuber. On learning to think: Algorithmic information theory for novel combinations of reinforcement learning controllers and recurrent neural world models. arXiv preprint arXiv:1511.09249, 2015.
  • (51) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • (52) Karthikeyan Shanmugam, Murat Kocaoglu, Alexandros G Dimakis, and Sriram Vishwanath. Learning causal graphs with small interventions. In Advances in Neural Information Processing Systems, pages 3195–3203, 2015.
  • (53) Wieslaw Sienko, Wieslaw M Citko, and Bogdan M Wilamowski. Hamiltonian neural nets as a universal signal processor. In IEEE 2002 28th Annual Conference of the Industrial Electronics Society. IECON 02, volume 4, pages 3201–3204. IEEE, 2002.
  • (54) Felipe Leno Da Silva and Anna Helena Reali Costa. Object-oriented curriculum generation for reinforcement learning. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 1026–1034. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
  • (55) David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel Dulac-Arnold, David Reichert, Neil Rabinowitz, Andre Barreto, et al. The predictron: End-to-end learning and planning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3191–3199. JMLR. org, 2017.
  • (56) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • (57) Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
  • (58) Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in neural information processing systems, pages 2753–2762, 2017.
  • (59) Sam Toyer, Felipe Trevizan, Sylvie Thiébaux, and Lexing Xie. Action schema networks: Generalised policies with deep learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • (60) Sjoerd van Steenkiste, Michael Chang, Klaus Greff, and Jürgen Schmidhuber. Relational neural expectation maximization: Unsupervised discovery of objects and their interactions. arXiv preprint arXiv:1802.10353, 2018.
  • (61) Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3540–3549. JMLR. org, 2017.
  • (62) Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3540–3549. JMLR. org, 2017.
  • (63) Tingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler. Nervenet: Learning structured policy with graph neural networks. 2018.
  • (64) Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pages 2746–2754, 2015.
  • (65) Nicholas Watters, Daniel Zoran, Theophane Weber, Peter Battaglia, Razvan Pascanu, and Andrea Tacchetti. Visual interaction networks: Learning a physics simulator from video. In Advances in neural information processing systems, pages 4539–4547, 2017.
  • (66) Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart, et al. Relational deep reinforcement learning. arXiv preprint arXiv:1806.01830, 2018.

7 Breakout

Figure 5: The Breakout domain. The paddle has been colored green, the ball has been colored red, the walls have been colored orange, and the blocks have been colored purple. The ball moves at a constant velocity, unless it comes into contact with the blocks, wall or paddle. The ball has an instantaneous change in velocity on contact. The paddle can move left and right or stay stationary at each timestep. When a block is hit by the ball, it disappears and extrinsic reward is given. If the ball hits the bottom wall 5 times, the episode ends. However, if all blocks are hit the loss counter resets, which inflates scores past 100 points. When used to train HyPE, the full image is grayscale and binarized (all values are 0 or 1)

8 Atari 2600 Breakout

While our version of Breakout has all the same intuition as the Atari 2600 version, in this work we do not use the traditional Atari 2600 version of the game. This is because it encroaches on 3 of our guiding assumptions. In particular, consistent properties, consistent relationships and the quasi-static property. In the case of visual structure, we assume that the visual scene is the same as the world state. However, in Atari 2600 objects can become occluded and disappear from the scene. Second, we assume that control is constant in terms of displacement, but this is not true in two ways. First, paddle control has a momentum value such that the first action or change of action produces partial changes to the paddle shape and position. Second, the ball changes velocity and appearance depending on the number of bounces that has been made overall in the episode. This non-stationarity also affects the quasi-static property, where the properties of the ball change not as a result of interaction with an object, but because of an internal counter. While we do not expect that these problems are insurmountable, solving them requires significant extensions to the existing system.


[H] HyPE Loop \[email protected]@algorithmic \STATEInput: Raw actions Araw, Environment E \STATEInitialize: Dataset 𝒟=[], Graph 𝒢={} \STATEAdd 𝒜raw to 𝒢 \STATECollect data from random policy: 𝒟𝒟πRandom Repeat \REPEAT\STATEChoose node Ni from 𝒢 randomly \STATEGet 𝐱oi={fθoj(xraw),xraw𝒟} \STATEobject discovery Optimize argmaxθL(xoi,fθoj(xraw)),xoi,xraw𝐱oi,𝒟 (Equation 2) \IFL(xoi,fθoj(xraw))>τvision \STATEHypothesis Proposal \STATEGet 𝐂oj=C(fθoj(xraw(t))),xraw(t)𝒟 \STATEGet 𝐒oi,oj=S(xoi(t),fθoj(xraw(t))),xraw(t)𝒟,xoi(t)𝐱oi \IFχ(𝐂oj,𝐒oi,oj) (Equation 3 \STATEDefine H(xoi(t,t+1),xoj(t,t+1)) (Section 2.2) \STATEDefine all RΔxojdk(xoi,oj(t,t+1))22 2 if multiple control hypotheses and input state [xoi-xoj,x˙oj,xoi,xoj] 33 3 where relevant, that is, only include terms where oi/oj not abstract \STATEPerform RL to maximize all RΔxojdk \ENDIF\ENDIF\UNTILSufficient performance

9 Object Discovery F1 Score

The F1 score is a common test for statistical significance, defined as 2precisionrecallprecision+recall. Given event A, and its complement A~, a classifier C that assigns to A, and a counts N(A),N(A~),N(CA), which are the number of true occurrances of A, B, and classification to A, then the precision is N(CA)N(A), and recall is N(CA)N(A)+N(B). In our case, event A is S(xoi(t),fθoj(xraw(t)))C(fθoj(xraw(t))), which is interpreted as a timestep where xoi,fθoj(xraw) are salient44 4 In this work, we use proximity for all object-object interactions (such as the Paddle and Ball), and the quasi-static assumption for abstract object-object interactions (such as the Actions and Paddle), and there is a changepoint in fθoj(xraw). Given counting function N, and operating on a full dataset of input

precision=N(S(xoi(t),xoj(t))C(xoj(t))N(C(xoj(t)))+N(S(xoi(t),xoj(t))) (8)


recall=N(S(xoi(t),xoj(t))C(xoj(t))N(C(xoj(t))) (9)

This precision and recall defines the desired F1 metric F1(xoi,fθoj(𝐱raw)).

10 Object Discovery Smoothing

The optimization yields a candidate object hypothesis h(*) which fails on complicated frames due to the limited number of parameters. We correct the model by training a larger CNN to match the target heatmaps z(h(*)(x(t))) generated by inserting a low-variance normal curve centered at the coordinate h(*)(x(t)) of each frame. The last filter is penalized with a lasso regularization resulting in Equation 10. Lastly, the object recognition is composed by h(x(t))=argmaxyAloc[f(x)]y.

L2(f;h(*))=||f(x(t))-z(h(*)(x(t)))||2+λ2||f(x(t))||1 (10)

11 Object Discovery Hyperparameters

The initial population weights are sampled from a normal distribution N(0,1). The hyperparameters are set as followed: the proximity threshold ϵ=6, regularization coefficient λ1=20, a population size of 50, and run for 40 epochs. An optimal filter can typically be picked up in this number of epochs, and the algorithm is not particularly sensitive to λ1 or population size (though a bad choice of λ1 will result in a stationary policy. The threshold for having passable F1 score for vision is 0.5. Values below this threshold can generally be generated by randomly jittering the frame to match some of the desired properties.

12 Saliency Functions

We use as saliency functions: 1) a proximity indicator. That is, the salience function returns true or false for a time step to define a salient time, based on the evaluation:

Sprox(xoi(t),xoj(t))={Truexoi(t)-xoj(t)<ϵFalseotherwise (11)

when oi,oj are objects with locations. With abstract objects (such as the raw actions, that are always proximal, or just alternatively), we use a a quasi-static based saliency function that checks if there is a changepoint in the trajectory of oi. That is, given a set of changepoints c1,cn, the salience is:

SC(xoi(t),xoj(t))={1t{c1,cn}0otherwise (12)

In terms of which salience to choose, we can simply apply the hypothesis testing or object detection with all salience functions until one sticks (though abstract objects don’t use proximity except as dummy salience). Finally, we can extend either salience function to include a window around salience, which can account for noise and some delayed reaction (i.e. calling the union of sets {ci-w,,ci+w} salient). For control hypotheses (hypotheses involving Δxoj, since we assume that object in contact are interacting if proximity has already been shown to produce changepoint interactions, we use proximity for salience to give a dense reward whenever the object has the desired motion, given proximal. This is useful for abstract objects, where we expect the abstract object to control or be controlled directly by some property of the premise object.

In practice, we use a distance of approximately 6 pixels to define proximity, and do not use a window around changepoints.

13 Network Ablations

Figure 6: Network architecture, where different input operations on the locations of oi and oj are mapped to a fixed length vectors, and the mean of the output is fed forward to get the probability distribution over actions

We tried several other network architectures. In particular, we used a fully connected network, which struggled to interpret the xoi,xoj inputs, if included, but otherwise performed comparably (though with different, often simpler, output policies). We also tried a transformer style network, which computed keys, queries and values for each of the inputs. However, the size of our network was limited by using CMA-ES, and performance on comparably sized networks was strictly worse. This network was simplification, where the mean operation allows the network to balance the inputs without fixating on any one of them. We tried different sizes of maps (other than 128), which the network is fairly agnostic to. Even a size 32 map can learn useful behavior that achieved decent reward (>100). The maximum size is around 512, limited by the number of parameters in the network that was viable for CMA-ES optimization.

14 Argument for CMA-ES

We tried a variety of policy gradient and deep Q-learning methods to train on some function of the input parameters, without any success, using the true object locations to remove possible noise from the object detection system. This suggests that learning policies on object locations in a reasonable number of frames seems to be a difficult for generic deep RL algorithms, at least for choosing a good set of hyperparameters. We attempted A2C, PPO, Rainbow, DQN, SARSA, SARSA with functional basis, and tabular Q learning. CMA-ES perform well on this reduced input space, because it requires matrix inversion, an n3 operations. This capped the parameter number of the network to 11k parameters, which was probably not sufficient to perform well if learning from raw images.

15 Baseline architecture

Our base architecture for the baseline networks (A2C, PPO) consisted of a 8x8, 32 map filter with stride 4, followed by a 5x5, 64 map filter with stride 2, and a 3x3, 64 map filter of stride 1, followed by a linear layer from the output of the last convolutional layer to size 512. This layer is fully connected to actor and critic components. This architecture has been used in many Atari Deep RL environments, and is the default Atari network for the Google dopamine framework: https://github.com/google/dopamine

16 Learning Parameters for CHAMP and DP-GMM

The system is fairly agnostic to the parameters for CHAMP and DP-GMM, though a bad choice can produce bad behavior. CHAMP requires 5 parameters, and the displacement model requires an additional parameter. The champ parameters are: a guess at the mean length of a segment, the variance, the minimum segment length, the maximum number of particles, and the resampled particles per time step. For the mean and variance, we chose a reasonable value: 10, 10, but the method is fairly agnostic to these. For minimum segment length, we used 1, and using a different value will be damaging. The max particles was 100, with 100 as resample-able. Any reasonably large choice will perform fine. For model variance (the penalty for modeling errors), we used 0.01, which is agnostic to around 0.1, where segments become agnostic to the input, and 0.001, which exhibits over-segmenting (segments whenever the model does not perfectly predict, regardless of noise).

For the DP-GMMs, we use 10 means (it chooses the number that it needs), zero initial mean and initial covariance of 1e-10. This initial covariance is low because otherwise all the means end up being a single point. We use the implementation on sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html#sklearn.mixture. BayesianGaussianMixture, with default parameters there.

17 Learning Parameters for CMA-ES in Paddle Bouncing

For CMA-ES, we use a sample length of 100 frames, which increases after 12 epochs by 100 frames. We train for 30 epochs total. This value does not really change learning, but it optimizes sample efficiency. We use a population size of 10, and initial variance of 1 (initial mean is the initializations of the network, which uses a small uniform random). We use a gamma of 0.9 and evaluate fitness based on return, though the choice of gamma does not strongly affect learning.

18 Details for the HyPE loop

The HyPE loop randomly chooses to run vision learning on every node in the existing graph. It then tests to see if the vision loss F1 score is above the threshold. Then, it performs the hypothesis test, testing if the numbers are above the threshold. Significantly greater is only relevant when there are more than 10 occurrences in the total categories. For example, the ball must have been bounced off the paddle at least 10 times. Then, the threshold is such that the ratio must be greater than 0.7. The check for a hypothesis about Δxoj is always run afterwards, which involves taking the models after a salient changepoint, and clustering on the model displacements. Any cluster with greater than 20 occurrences is taken. For example, with the ball bouncing off the paddle, there are four clusters after the changepoint corresponding to the four angles at which the ball can bounce. Learning of multiple options involves switching between the different options when relevant (either after a changepoint with proximity-based saliency, or after a fixed number of timesteps with action-based saliency).

19 Details for Code

A series of code commands to run the two iterations of the HyPE loop are detailed in the README.md file in the code, as well as the requirements. Code can be found at https://github.com/CalCharles/contingency-options

20 Extension of base assumptions

While our base assumptions might seem restrictive, we believe that these assumptions still generalize to many problems, and especially real world domains. In general, the world can be reduced into objects whose properties do not spontaneously change: this assumption is the basis for tool use. The quasi-static property is generally true, since it is a formalization of the term “inanimate object”, which is a common class of objects to manipulate. Proximity in time and space is another well used assumption, which while not guaranteed to be true, certainl y is often true. Finally, while formally proving a relationship is difficult, learning a policy to produce a particular change loosely mirrors experimentation in the scientific method, without the same precision of design.

21 Extension of the HyPE algorithm on Breakout

Notice that the HyPE agent, while maximizing a reward derived from causing a ball changepoint near the paddle, incidentally maximizes the true reward. However, this does not prevent us from continuing to apply the HyPE loop, and eventually even learning that block objects have temporal proximity to the true reward. By observing the interactions between the ball and the paddle, the HyPE agent can propose a hypothesis to define the different angles at which the ball can come off the paddle. In fact, running the HyPE loop on this data results in learning 4 different ball angles (described in supplementary material). Additionally, HyPE can learn a relationship between the ball and the blocks, and the blocks and true reward. Using the ball angles as options, then, one can optimize the removal of blocks directly. However, limitations with the vision algorithm and with learning options to hit the ball at different angles prevents us from demonstrating this functionality. We also posit that the relationships can be learned bi-directionally, backward from the node corresponding to true reward.

Hypothesis Δxoj Δyoj
Hd0(xoi,xoj) 1.94 0.01
Hd1(xoi,xoj) 0.0 0.0
Hd2(xoi,xoj) -1.88 0.0
Hypothesis xoi-xoj yoi-yoj
H(xoi,xoj) -3.94 -2.87

22 Details on Hypotheses and Paddle policies

The action-paddle hypotheses use the control hypotheses as defined in the paper, with a saliency function of proximal, and control behaviors learned from DP-GMM on the displacement models in the segments after paddle changepoints. The values turn out to be: -1.88,1.94,0, where the right and left command in true space are -2,2,0. For the ball, we also perform de-noising by DP-GMM, by taking the mean location when a changepoint occurs. This turns out to be -3.94,-2.87, which is approximately above and to the left of the paddle. this is because the filters only look at changepoints, and are not necessarily centered. The different approximately mirrors the ball offset relative to the paddle offset (the ball in learning is more to the left).