On Inductive Biases in Deep Reinforcement Learning

  • 2019-07-05 16:14:55
  • Matteo Hessel, Hado van Hasselt, Joseph Modayil, David Silver
  • 1

Abstract

Many deep reinforcement learning algorithms contain inductive biases thatsculpt the agent's objective and its interface to the environment. Theseinductive biases can take many forms, including domain knowledge and pretunedhyper-parameters. In general, there is a trade-off between generality andperformance when algorithms use such biases. Stronger biases can lead to fasterlearning, but weaker biases can potentially lead to more general algorithms.This trade-off is important because inductive biases are not free; substantialeffort may be required to obtain relevant domain knowledge or to tunehyper-parameters effectively. In this paper, we re-examine severaldomain-specific components that bias the objective and the environmentalinterface of common deep reinforcement learning agents. We investigated whetherthe performance deteriorates when these components are replaced with adaptivesolutions from the literature. In our experiments, performance sometimesdecreased with the adaptive components, as one might expect when comparing tocomponents crafted for the domain, but sometimes the adaptive componentsperformed better. We investigated the main benefit of having fewerdomain-specific components, by comparing the learning performance of the twosystems on a different set of continuous control problems, without additionaltuning of either system. As hypothesized, the system with adaptive componentsperformed better on many of the new tasks.

 

Quick Read (beta)

On Inductive Biases in Deep Reinforcement Learning

Matteo Hessel*    Hado van Hasselt*    Joseph Modayil    David Silver   
DeepMind
   London    UK
Abstract

Many deep reinforcement learning algorithms contain inductive biases that sculpt the agent’s objective and its interface to the environment. These inductive biases can take many forms, including domain knowledge and pretuned hyper-parameters. In general, there is a trade-off between generality and performance when algorithms use such biases. Stronger biases can lead to faster learning, but weaker biases can potentially lead to more general algorithms. This trade-off is important because inductive biases are not free; substantial effort may be required to obtain relevant domain knowledge or to tune hyper-parameters effectively. In this paper, we re-examine several domain-specific components that bias the objective and the environmental interface of common deep reinforcement learning agents. We investigated whether the performance deteriorates when these components are replaced with adaptive solutions from the literature. In our experiments, performance sometimes decreased with the adaptive components, as one might expect when comparing to components crafted for the domain, but sometimes the adaptive components performed better. We investigated the main benefit of having fewer domain-specific components, by comparing the learning performance of the two systems on a different set of continuous control problems, without additional tuning of either system. As hypothesized, the system with adaptive components performed better on many of the new tasks.

The deep reinforcement learning (RL) community has demonstrated that well-tuned deep RL algorithms can master a wide range of challenging tasks. Human-level performance has been approached or surpassed on board-games such as Go and Chess (Silver et al., 2017), video-games such as Atari (Mnih et al., 2015; Hessel et al., 2018a; Horgan et al., 2018), and custom 3D navigation tasks (Johnson et al., 2016; Kempka et al., 2016; Beattie et al., 2016; Mnih et al., 2016; Jaderberg et al., 2016; Espeholt et al., 2018). These results are a testament to the generality of the overall approach. At times, however, the excitement for the constant stream of new domains being mastered by suitable RL algorithms may have over-shadowed the dependency on inductive biases of these agents, and the amount of tuning that is often required for these to perform effectively in new domains. A clear example of the benefits of generality is the AlphaZero algorithm (Silver et al., 2017). AlphaZero differs from the earlier AlphaGo algorithm (Silver et al., 2016) by removing all dependencies on Go-specific inductive biases and human data. After removing these biases, AlphaZero did not just achieve higher performance in the game of Go, but could also learn effectively to play Chess and Shogi.

In general, there is a trade off between generality and performance when we inject inductive biases into our algorithms. Inductive biases take many forms, including domain knowledge and pretuned learning parameters. If applied carefully, such biases can lead to faster and better learning. On the other hand, fewer biases can potentially lead to more general algorithms that work out of the box on a wider class of problems. Crucially, most inductive biases are not free: for instance, substantial effort can be required to attain the relevant domain knowledge or pretune parameters. This cost is often hidden—for instance, one might use hyperparameters established as good in prior work on the same domain, without knowing how much data or time was spent optimising these, nor how specific the settings are to the given domain. Systematic studies about the impact of different inductive biases are rare and the generality of these different biases is often unclear. Another consideration is that inductive biases may mask the generality of other parts of the system as a whole; if a learning algorithm tuned for a specific domain does not generalize out of the box to a new domain, it can be unclear whether it is due to the having the wrong inductive biases or whether the underpinning learning algorithm is lacking something important.

In this paper, we consider several commonly used domain heuristics, and investigate if (and by how much) performance deteriorates when we replace these with more general adaptive components, and we assess the impact of such replacements on the generality of the agent. We consider two broad ways of injecting inductive biases in RL agents: 1) sculpting the agent’s objective (e.g., clipping and discounting rewards), 2) sculpting the agent-environment interface (e.g., repeating each selected action a hard-coded fixed number of times, or crafting of the learning agent’s observation). We first highlight, in a set of carefully constructed toy domains, the limitations of common instantiations of the these classes of inductive biases. Next, we show, in the context of the Atari Learning Environment (Bellemare et al., 2013), that the carefully crafted heuristics commonly used in this domain (and often considered essential for good performance) can be safely replaced with adaptive components, while preserving competitive performance across the benchmark. Finally, we show that this results in increased generality for an actor critic agent; the resulting fully adaptive system can be applied with no additional tuning on a separate suite of continuous control tasks. Here, the adaptive system achieved much higher performance than a comparable system using the Atari tuned heuristics, and also higher performance than an actor-critic agent that was tuned for this benchmark (Tassa et al., 2018).

1 Background

Problem setting:

Reinforcement learning is a framework for learning and decision making under uncertainty, where an agent interacts with its environment in a sequence of discrete steps, executing actions At and getting observations Ot+1 and rewards Rt+1 in return. The behaviour of an agent is specified by a policy π(At|Ht): a probability distribution over actions conditional on previous observations (the history Ht=O1:t). The agent’s objective is to maximize the rewards collected in each episode of experience under policy π, and it must learn such policy without direct supervision, by trial and error. The amount of reward collected from time t onwards, the return, is a random variable

Gt=k=0TendγkRt+k+1, (1)

where Tend is the number of steps until episode termination and γ[0,1] is a constant discount factor. An optimal policy is one that maximizes the expected returns or values: v(Ht)=Eπ[Gt|Ht]. In fully observable environments the optimal policy depends on the last observation alone: π*(At|Ht)=π*(At|Ot). Otherwise, the history may be summarized in an agent state St=f(Ht). The agent’s objective is then to jointly learn the state representation f and policy π(At|St) to maximize values. The fully observable case is formalized as a Markov Decision Process Bellman (1957).

Actor-critic algorithms:

Value-based algorithms efficiently learn to approximate values v𝐰(s)vπ(s)Eπ[Gt|St=s], under a policy π, by exploiting the recursive decomposition vπ(s)=E[Rt+1+γvπ(St+1)|St=s] known as the Bellman equation, which is used in temporal difference learning Sutton (1988) through sampling and incremental updates:

Δ𝐰t=(Rt+1+γv𝐰(St+1)-v𝐰(St))𝐰v𝐰(St). (2)

Policy-based algorithms update a parameterized policy π𝜽(At|St) directly through a stochastic gradient estimate of the direction of steepest ascent in the value Williams (1992); Sutton et al. (2000), for instance:

Δ𝜽t=Gtlogπ𝜽(At|St). (3)

Value-based and policy-based methods are combined in actor-critic algorithms. If a state value estimate is available, the policy updates can be computed from incomplete episodes by using the truncated returns Gt(n)=k=0n-1γkRt+k+1+γnv𝐰(St) that bootstrap on the value estimate at state St+n according to v𝐰. This can reduce the variance of the updates. The variance can be further reduced using state values as a baseline in policy updates, as in advantage actor-critic updates

Δ𝜽t=(Gt(n)-v𝐰(St))𝜽logπ𝜽(At|St). (4)

2 Common inductive biases and
corresponding adaptive solutions

We now describe a few commonly used heuristics within the Atari domain, together with the adaptive replacements that we investigated in our experiments.

2.1 Sculpting the agent’s objective

Many current deep RL agents do not directly optimize the true objective that they are evaluated against. Instead, they are tasked with optimizing a different handcrafted objective that incorporates biases to make learning simpler. We consider two popular ways of sculpting the agent’s objective: reward clipping, and the use of fixed discounting of future rewards by a factor different from the one used for evaluation.

In many deep RL algorithms, the magnitude of the updates scales linearly with the returns. This makes it difficult to train the same RL agent, with the same hyper-parameters, on multiple domains, because good settings for hyper-parameters such as the learning rate vary across tasks. One common solution is to clip the rewards to a fixed range (Mnih et al., 2015), for instance [-1,1]. This clipping makes the magnitude of returns and updates more comparable across domains. However, this also radically changes the agent objective, e.g., if all non-zero rewards are larger than one, then this amounts to maximizing the frequency of positive rewards rather than their cumulative sums. This can simplify the learning problem, and, when it is a good proxy for the true objective, can result in good performance. In other tasks, however, clipping can result in sub-optimal policies because the objective that is optimized is ill-aligned with the true objective.

PopArt (van Hasselt et al., 2016; Hessel et al., 2018b) was introduced as a principled solution to learn effectively irrespective of the magnitude of returns. PopArt works by tracking the mean μ and standard deviation σ of bootstrapped returns Gt(n). Temporal difference errors on value estimates can then be computed in a normalized space, with n𝐰(s) denoting the normalized value, while the unnormalized values (needed, for instance, for bootstrapping) are recovered by a linear transformation v𝐰(s)=μ+σ*n𝐰(s). Doing this naively increases the non-stationarity of learning since the unnormalized predictions for all states change every time we change the statistics. PopArt therefore combines the adaptive rescaling with an inverse transformation of the weights at the last layer of n𝐰(s), thereby preserving outputs precisely under any change in statistics μμ and σσ. This is done exactly by updating weights and biases as w=wσ/σ and b=(σb+μ-μ)/σ.

Discounting is part of the traditional MDP formulation of RL. As such, it is often considered a property of the problem rather than a tunable parameter on the agent side. Indeed, sometimes, the environment does define a natural discounting of future rewards (e.g., inflation in a financial setting). However, even in episodic settings where the agent should maximize the undiscounted return, a constant discount factor is often used to simplify learning (by having the agent focus on a relatively short time horizon). Optimizing this proxy of the true return often results in the agent achieving superior performance even in terms of the undiscounted return (Machado et al., 2017). This benefit comes with the cost of adding a hyperparameter, and a rather sensitive one: learning might be fast if the discount is small, but the solution may be too myopic.

Instead of tuning the discount manually, we use meta-learning (cf. Sutton, 1992; Bengio, 2000; Finn et al., 2017; Xu et al., 2018) to adapt the discount factor. The meta-gradient algorithm Xu et al. (2018) uses the insight that the updates in Equations (2) and (4) are differentiable functions of hyper-parameters such as the discount. On the next sample or rollout of experience, using updated parameters 𝐰+Δ𝐰(γ), written here as an explicit function of the discount, the agent then applies a gradient based actor-critic update, not to parameters 𝐰, but to the parameter 𝜽 that defines the discount γ which is used in a standard learning update. This approach was shown to improve performance on Atari Xu et al. (2018), when using a separate hand-tuned discount factor for the meta-update. We instead use the undiscounted returns (γm=1) to define the meta-gradient updates, to test whether this technique can fully replace the need for manual tuning discounts.

A related heuristic, quite specific to Atari, is to track the number of lives that the agent has available (in several Atari games the agent is allowed to die a fixed number of times before the game is over), and hard code an episode termination (γ=0) when this happens. We ignore the number of lives channel exposed by the Arcade Learning Environment in all our experiments.

2.2 Sculpting the agent-environment interface

A common assumption in reinforcement learning is that time progresses in discrete steps with a fixed duration. Although algorithms are typically defined in this native space, learning at the fastest timescale provided by the environment may not be practical or efficient, at least with the current generation of learning algorithms. It is often convenient to have the agent operate at a slower timescale, for instance by repeating each selected action a fixed number of times. The use of fixed action repetitions is a widely used heuristic (e.g., Mnih et al., 2015; van Hasselt et al., 2016; Wang et al., 2016; Mnih et al., 2016) with several advantages. 1) Operating at a slower timescale increases the action gap (Farahmand, 2011), which can lead to more stable learning (Bellemare et al., 2015) because it becomes easier to appropriately rank actions reliably when the value estimates are uncertain or noisy. 2) Selecting an action every few steps can save a significant amount of computation. 3) Committing to each action for a longer duration may help exploration, because the diameter of the solution space has effectively been reduced, for instance removing some often-irrelevant sequences of actions that jitter back and forth at a fast time scale.

A more general solution approach is for the agent to learn the most appropriate time scale at which to operate. Solving this problem in full generality is one aim of hierarchical reinforcement learning (Dayan and Hinton, 1993; Wiering and Schmidhuber, 1997; Sutton. et al., 1998; Bacon et al., 2017). This general problem remains largely unsolved. A simpler, though more limited, approach is for the agent to learn how long to commit to actions (Lakshminarayanan et al., 2017). For instance, at each step t, the agent may be allowed to select both an action At and a commitment Ct, by sampling from two separate policies, both trained with policy gradient. Committing to an action for multiple steps raises the issue of how to handle intermediate observations without missing out on the potential computational savings. Conventional deep RL agents for Atari max-pool multiple image frames into a single observation. In our setting, the agent gets one image frame as an observation after each new action selection. The agent needs to learn to trade-off the benefits of action repetition (e.g., lower variance, more directed behaviour) with its disadvantages (e.g., not being able to revise its choices during as often, and missing potentially useful intermediate observations).

Many state-of-the-art RL agents use non-linear function approximators to represent values, policies, and states. The ability to learn flexible state representations was essential to capitalize on the successes of deep learning, and to scale reinforcement learning algorithms to visually complex domains (Mnih et al., 2015). While the use of deep neural network to approximate value functions and policies is widespread, their input is often not the raw observations but the result of domain-specific heuristic transformations. In Atari, for instance, most agents rely on down-sampling the observations to an 84×84 grid (down from the original 210×160 resolution), grey scaling them, and finally concatenating them into a K-Markov representation, with K=4. We replace this specific preprocessing pipeline with a state representation learned end-to-end. We feed the RGB pixel observations at the native resolution of the Arcade Learning Environment into a convolutional network with 32 and 64 channels (in the first and second layer, respectively), both using 5×5 kernels with a stride of 5. The output is fed to a fully connected layer with 256 hidden units, and then to an LSTM recurrent network (Hochreiter and Schmidhuber, 1997) of the same size. The policy for selecting the action and its commitment is computed as logits coming from two separate linear outputs of the LSTM. The network must then integrate information over time to cope with any issues like the flickering of the screen that had motivated the standard heuristic pipeline used by deep RL agents on Atari.

Figure 1: Investigations on the robustness of an A2C agent with respect to discounting, reward scaling and action repetitions. We report the average reward per environment step, after 5000 steps of training, for each of 20 distinct seeds. Each parameter study compares different fixed configurations of a specific hyper-parameter to the corresponding adaptive solution. In all cases the performance of the adaptive solutions is competitive with that of the best tuned solution

3 Experiments

When designing algorithms it is useful to keep in mind what properties we would like the algorithm to satisfy. If the aim is to design an algorithm, or inductive bias, that is general, in addition to metrics such as asymptotic performance and data efficiency, there are additional dimensions that are useful to consider. 1) Does the algorithm require careful reasoning to select an appropriate time horizon for decision making? This is tricky without domain knowledge or tuning. 2) How robust is the algorithm to reward scaling? Rewards can have arbitrary scales, that may change by orders of magnitudes during training. 3) Can the agent use commitment (e.g. action repetitions, or options) to alleviate the difficulty of learning at the fastest time scale? 4) Does the algorithm scale effectively to large complex problems? 5) Does the algorithm generalize well to problems it was not specifically designed and tuned for?

None of these dimensions is binary, and different algorithms may satisfy each of them to a different degree, but keeping them in mind can be helpful to drive research towards more general reinforcement learning solutions. We first discuss the first three in isolation, in the context of simple toy environments, to increase the understanding about how adaptive solutions compare to the corresponding heuristics they are intended to replace. We then use the 57 Atari games in the Arcade Learning Environment (Bellemare et al., 2013) to evaluate the performance of the different methods at scale. Finally, we investigate how well the methods generalize to new domains, using 28 continuous control tasks in the DeepMind Control Suite (Tassa et al., 2018).

3.1 Motivating Examples

We used a simple tabular actor-critic agent (A2C) to investigate in a minimal setup how domain heuristics and adaptive solutions compare with respect to some of these dimensions. We report average reward per step, after 5000 environment steps, for each of 20 replicas of each agent.

First, we investigate the role of discounting for effective learning. Consider a small chain environment with T=9 states and 2 actions. The agent starts every episode in the middle of the chain. Moving left provides a -1 penalty. Moving right provides a reward of 2d/T, where d is the distance from the left end. When either end of the chain is reached, the episode ends, with an additional reward T on the far right end. Figure 1a shows a parameter study over a range of values for the discount factor. We found the best performance was between 0.5 and 0.9, where learning is quite effective, but observed decreased performance for lower or higher discounts. This shows that it can be difficult to set a suitable discount factor, and that the naive solution of just optimizing the undiscounted return may also perform poorly. Compare this to the same agent, but equipped with the adaptive meta-gradient algorithm discussed in Section 2.1 (in orange in Figure 1.a). Even initializing the discount to the value of 0.95 (which performed poorly in the parameter study), the agent learned to reduce the discount and performed in par with the best tuned fixed discount.

Next, we investigate the impact of reward scaling. We used the same domain, but keep the discount fixed to a value of 0.8 (as it was previously found to work well). We examine instead the performance of the agent when all rewards are scaled by a constant factor. Note that in the plots we report the unscaled rewards to make the results interpretable. Figure 1.b shows that the performance of the vanilla A2C agent (in blue) degraded rapidly when the scale of the rewards was significantly smaller or larger than 1. Compare this to the same agent equipped with PopArt, we observe better performance across multiple orders of magnitude for the reward scales. In this tiny problem, learning could also be achieved by tuning the learning rate for each reward scale, but that does not suffice for larger problems. Adaptive optimization algorithm such as Adam (Kingma and Ba, 2014) or RMSProp (Tieleman and Hinton, 2012) can also provide some degree of invariance, but, as we will see in Section 3.2, they are not as effective as PopArt normalization.

Finally, we investigate the role of action repeats. We consider states arranged into a simple cycle of 11 states. The agent starts in state 0, and only moves in one direction using one action, the other action does not move the agent. The reward is 0 everywhere, except if the agent selects the non-moving action in the 11-th state: in this case the agent receives a reward of 100 and the episode ends. We compare an A2C agent that learns to choose the number of action repeats (up to 10), to an agent that used a fixed number of repetitions C. Figure 1.c shows how the number of fixed action repeats used by the agent is a sensitive hyper-parameter in this domain. Compare this to the adaptive agent that learns how often to repeat actions via policy gradient (in orange in Figure 1.c). This agent quickly learned a suitable number of action repeats and thereby performed very well. This is a general problem, in many domains of interest it can be useful to combine fine-grained control in certain states, with more coarse and directed behaviour in other parts of the state space.

Figure 2: Comparison of inductive biases to RL solutions. All curves show mean episode return as a function of the number of environment steps. Each plot compares the same fully general agent to 2 alternative. a) tuned action repeats, and no action repeats. b) tuned discount factor, and no discounting. c) reward clipping, and learning from raw rewards with no rescaling of the updates. d) learning from the raw observation stream of Atari, and the standard preprocessing.

3.2 Performance on large domains

To evaluate the performance of the different methods on larger problems, we use A2C agents on many Atari games. However, differently from the previous experiments, the agent learns in parallel from multiple copies of the environment, similarly to many state-of-the-art algorithms for reinforcement learning. This configuration increases the throughput of acting and learning, and speeds up the experiments. In parallel learning training setups, the learning updates may be applied synchronously (Espeholt et al., 2018) or asynchronously (Mnih et al., 2016). Our learning updates are synchronous: the agent’s policy takes steps in parallel across 16 copies of the environment to create multi-step learning updates, batched together to compute a single update to the parameters. We train individual agents on each game. Per-game scores are averaged over 8 seeds, and we then track the median human normalized score across all games. All hyper-parameters for our A2C agents were selected for a generic A2C agent on Atari before the following experiments were performed, with details given in the appendix.

Our experiments measure the performance of a full adaptive A2C agent with learned action repeats, PopArt normalization, learned discount factors, and an LSTM-based state representation. We compare the performance of this agent to agents with exactly one adaptive component disabled and replaced with one of two fixed components. This fixed component is either falling back to the environment specified task (e.g. learning directly from undiscounted returns), or using the corresponding fixed heuristic from DQN. These comparisons enable us to investigate how important the original heuristic is for current RL algorithms, as well as how fully an adaptive solution can replace it.

In Figure 2a, we investigate action repeats and their impact on learning. We compare the fully general agent to an agent that used exactly 4 action repetitions (as tuned for Atari (Mnih et al., 2015)), and to an agent that acted and learned at the native frame rate of the environment. The adaptively learned solution performed almost as well as the tuned domain heuristic of always repeating each action 4 times. Interestingly, in the first 100M frames, also acting at the fastest rate was competitive with the agents equipped with action repetition (whether fixed or learned), at least in terms of data efficiency. However, while the agents with action repeats were still improving performance until the very end of the training budget, the agent acting at the fastest timescale appeared to plateau earlier. This performance plateau is observed in multiple games (see appendix), and we speculate that the use of multiple action repetitions may be helping achieve better exploration. We note that, in wall-clock time, the gap in the performance of the agents with action repetitions was even larger due to the additional compute.

In Figure 2b, we investigate discounting. The agent that used undiscounted returns directly in the updates to policy and values performed very poorly, demonstrating that in complex environments the naive solution of directly optimizing the real objective is problematic with modern deep RL agents. Interestingly, while performance was very poor overall, the agent did demonstrate good performance on a few specific games. For instance, in bowling it achieved a better score than state of the art agents such as Rainbow (Hessel et al., 2018a) and ApeX (Horgan et al., 2018). The agent with tuned discount and the agent with a discount factor learned through meta-gradient RL performed much better overall. The adaptive solution did slightly better than the heuristic.

In Figure 2c, we investigate the effect of reward scales. We compare the fully adaptive agent to an agent where clipping was used in place of PopArt, and to a naive agent that used the environment reward directly. Again, the naive solution performed very poorly, compared to using either the domain heuristic or the learned solution. Note that the naive solution is using RMSProp as an optimizer, in combination with gradient clipping by norm (Pascanu et al., 2012); together these techniques should provide at least some robustness to scaling issues, but in our experiments PopArt provided an additional large increase in performance. In this case, the domain heuristic (reward clipping) retained a significant edge over the adaptive solution. This suggests that reward clipping might not be helping exclusively with reward scales; the inductive bias of optimizing for a weighted frequency of rewards is a very good heuristic in many Atari games, and the qualitative behaviour resulting from optimizing the proxy objective might result in a better learning dynamics. We note, in conclusion, that while clipping was better in aggregate, PopArt yielded significantly improved scores on several games (e.g., centipede) where the clipped agent was stuck in sub-optimal policies.

Finally, in Figure 2d, we compare the fully end to end pipeline with a recurrent network, to a feedforward neural network with the standard Atari pipeline. The recurrent end to end solution performed best, showing that a recurrent network is sufficiently flexible to learn on its own to integrate relevant information over time, despite the Atari-specific features of the observation stream (such as the flickering of the screen) that motivated the more common heuristic approach.

Figure 3: In a separate experiment on the 28 tasks in the DeepMind Control Suite, we compared the general solution agent with an A2C agent using all the domain heuristics previously discussed. Both agents were trained and evaluated on the new domain with no changes to the algorithm nor any additional tuning for this very different set of environments. On average, the general adaptive solutions transfer better to the new domain that the heuristic solution. On the left we plot the average performance across all 28 tasks. On the right we show the learning curves on a selection of 10 tasks.

3.3 Generalization to new domains

Our previous analysis shows that learned solutions are mostly quite competitive with the domain heuristics on Atari, but do not uniformly provide additional benefits compared to the well tuned inductive biases that are commonly used in Atari. To investigate the generality of these different RL solutions, in this section we compare the fully general agent to an agent with all the usual inductive biases, but this time we evaluate them on a completely different benchmark: a collection of 28 continuous control tasks in the DeepMind Control Suite. The tasks represent a wide variety of physical control problems, and the dimension of the real-valued observation and action vectors differs across the tasks. The environment state can be recovered from the observation in all but one task. The rewards are bounded between 0 and 1, and tasks are undiscounted.

Again, we use a parallel A2C implementation, with 16 copies of the environment, and we aggregate results by first averaging scores across 8 seeds, and then taking the mean across all 28 tasks. Because all tasks in this benchmark are designed to have episode returns of comparable magnitude, there is no need to normalize the results to meaningfully aggregate them. For both agents we use the exact same solutions that were used in Atari, with no additional tuning. The agents naturally transfer to this new domain with two modifications: 1) we do not use convolutions since the observations do not have spatial structure. 2) the outputs of the policy head are interpreted as encoding the mean and variance of a Gaussian distribution instead of as the logits of a categorical one.

Figure 3 shows the fully general agent performed much better than the heuristic solution, which suggests that the set of inductive biases typically used by Deep RL agents on Atari do not generalize as well to this new domain as the set of adaptive solutions considered in this paper. This highlights the importance of being aware of the priors that we incorporate into our learning algorithms, and their impact on the generality of our agents. On the right side of Figure 3, we report the learning curves on the 10 tasks for which the absolute difference between the performance of the two agents was greatest (details on the full set of 28 tasks can be found in Appendix). The adaptive solutions performed equal or better than the heuristics on each of these 10 tasks, and the results in the appendix show performance was rarely worse. The reference horizontal black lines mark the performance of an A3C agent, tuned specifically for this suite of tasks, as reported by Tassa et al. (2018). The adaptive solution was also better, in aggregate, than this well tuned baseline; note however the tuned A3C agent achieved higher performance on a few games.

4 Related Work and Discussion

The present work was partially inspired by the work of Silver et al. (2017) in the context of Go. They demonstrated that specific domain specific heuristics (e.g. pretraining on human data, the use of handcrafted Go-specific features, and exploitation of certain symmetries in state space), while originally introduced to simplify learning (Silver et al., 2016), had actually outlived their usefulness: taking a purer approach, even stronger Go agents could be trained. Importantly, they showed removing these domain heuristics, the same algorithm could master other games, such as Shogi and Chess. In our paper, we adopted a similar philosophy but investigated the very different set of domain specific heuristics, that are used in more traditional deep reinforcement learning agents.

Our work relates to a broader debate (Marcus, 2018) about priors and innateness. There is evidence that we, as humans, posses specific types of biases, and that these have a role in enabling efficient learning (Spelke and Kinzler, 2007; Dubey et al., 2018); however, it is far from clear the extent to which these are essential for intelligent behaviour to arise, what form these priors take, and their impact on the generality of the resulting solutions. In this paper, we demonstrate that several heuristics we commonly use in our algorithms already harm the generality of our methods. This does not mean that other different inductive biases could not be useful as we progress towards more flexible, intelligent agents; it is however a reminder to be careful with the domain knowledge and priors we bake into our solutions, and to be prepared to revise them over time.

We found that existing learned solutions are competitive with well tuned domain heuristics, even on the domain these heuristics were designed for, and they seem to generalize better to unseen domain. This makes a case for removing these biases in future research on Atari, since they are not essential for competitive performance, and they might hide issues in the core learning algorithm. The only case where we still found a significant gap in favour of the domain heuristic was in the case of clipping. While, to the best of our knowledge, PopArt does address scaling issues effectively, clipping still seems to help on several games. Changing the reward distribution has many subtle implications for the learning dynamics, beside affecting the magnitude of updates (e.g. exploration, risk-propensity, …). We leave to future research to investigate what other general solutions could be deployed in our agents to fully recover the observed benefits of clipping.

Several of the biases that we considered have knobs that could be tuned rather than learned (e.g. the discount, the number of repeats, etc); however, this is not a satisfying solution for several reasons. Tuning is expensive, and these heuristics interact subtly with each other, thus requiring an exploration of a combinatorial space to find suitable settings. Consider the use of fixed action repeats: when changing the number of repetitions you also need to change the discount factor, otherwise this will change the effective horizon of the agent, which in turn affects the magnitude of the returns and therefore the learning rate. Also, a fixed tuned value might still not give you the full benefits of an adaptive learned approach that can adapt to the various phases of the training process.

There are other two features of our algorithm that, despite not incorporating quite as much domain knowledge as the heuristics discussed in this paper, also constitute a potential impediment to its generality and scalability. 1) The use of parallel environments is not always feasible in practice, especially in real world applications (although recent work on robot farms Levine et al. (2016) shows that it might still be a valid approach when sufficient resources are available). 2) The use of back-propagation through time for training recurrent state representations constrains the length of the temporal relationship that we can learn, since the memory consumption is linear in the length of the rollouts. Further work in overcoming these limitations, successfully learning online from a single stream of experience, is a fruitful direction for future research.

References

  • Bacon et al. (2017) P. Bacon, J. Harb, and D. Precup. The option-critic architecture. AAAI Conference on Artificial Intelligence, 2017.
  • Beattie et al. (2016) C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler, A. Lefrancq, S. Green, V. Valdés, A. Sadik, J. Schrittwieser, K. Anderson, S. York, M. Cant, A. Cain, A. Bolton, S. Gaffney, H. King, D. Hassabis, S. Legg, and S. Petersen. Deepmind lab. CoRR, abs/1612.03801, 2016.
  • Bellemare et al. (2013) M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The Arcade Learning Environment: An evaluation platform for general agents. JAIR, 2013.
  • Bellemare et al. (2015) M. G. Bellemare, G. Ostrovski, A. Guez, P. S. Thomas, and R. Munos. Increasing the action gap: New operators for reinforcement learning. CoRR, abs/1512.04860, 2015.
  • Bellman (1957) R. Bellman. Dynamic programming. Princeton University Press, 1957.
  • Bengio (2000) Y. Bengio. Gradient-based optimization of hyperparameters. Neural computation, 12(8):1889–1900, 2000.
  • Dayan and Hinton (1993) P. Dayan and G. E. Hinton. Feudal reinforcement learning. In Neural Information Processing Systems, 1993.
  • Dubey et al. (2018) Dubey, Agrawal, Pathak, Griffiths, and Efros. Investigating human priors for playing video games. CoRR, abs/1802.10217, 2018. URL http://arxiv.org/abs/1802.10217.
  • Espeholt et al. (2018) L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, 2018.
  • Farahmand (2011) A.-m. Farahmand. Action-gap phenomenon in reinforcement learning. In Neural Information Processing Systems, 2011.
  • Finn et al. (2017) C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017.
  • Hessel et al. (2018a) Hessel, Modayil, van Hasselt, Schaul, Ostrovski, Dabney, Horgan, Piot, Azar, and Silver. Rainbow. AAAI Conference on Artificial Intelligence, 2018a.
  • Hessel et al. (2018b) Hessel, Soyer, Espeholt, Czarnecki, Schmitt, and van Hasselt. Multi-task deep reinforcement learning with popart. AAAI Conference on Artificial Intelligence, 2018b.
  • Hochreiter and Schmidhuber (1997) S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Horgan et al. (2018) D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. van Hasselt, and D. Silver. Distributed prioritized experience replay. In International Conference on Learning Representations, 2018.
  • Jaderberg et al. (2016) M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. In International Conference on Learning Representations, 2016.
  • Johnson et al. (2016) M. Johnson, K. Hofmann, T. Hutton, and D. Bignell. The Malmo platform for artificial intelligence experimentation. In IJCAI, pages 4246–4247, 2016.
  • Kempka et al. (2016) Kempka, Wydmuch, Runc, Toczek, and Jaśkowski. Vizdoom: A doom-based ai research platform for visual rl. In Conference on Computational Intelligence and Games, 2016.
  • Kingma and Ba (2014) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2014.
  • Lakshminarayanan et al. (2017) Lakshminarayanan, Sharma, and Ravindran. Dynamic action repetition for deep reinforcement learning. In AAAI Conference on Artificial Intelligence, 2017.
  • Levine et al. (2016) S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen. Learning hand-eye coordination for robotic grasping with large-scale data collection. In ISER, 2016.
  • Machado et al. (2017) M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and M. Bowling. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents. ArXiv e-prints, Sept. 2017.
  • Marcus (2018) Marcus. Innateness, alphazero, and artificial intelligence. CoRR, abs/1801.05667, 2018.
  • Mnih et al. (2016) Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, and Kavukcuoglu. Asynchronous methods for deep rl. In International Conference on Machine Learning, 2016.
  • Mnih et al. (2015) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 2015.
  • Pascanu et al. (2012) R. Pascanu, T. Mikolov, and Y. Bengio. Understanding the exploding gradient problem. CoRR, abs/1211.5063, 2012. URL http://arxiv.org/abs/1211.5063.
  • Silver et al. (2017) Silver, Hubert, Schrittwieser, Antonoglou, Lai, Guez, Lanctot, Sifre, Kumaran, Graepel, Lillicrap, Simonyan, and Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. CoRR, abs/1712.01815, 2017.
  • Silver et al. (2016) D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 2016.
  • Spelke and Kinzler (2007) E. S. Spelke and K. D. Kinzler. Core knowledge. Developmental Science, 10:89–96, 2007.
  • Sutton (1988) Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 1988.
  • Sutton (1992) Sutton. Adapting bias by gradient descent: An incremental version of delta-bar-delta. In AAAI Conference on Artificial Intelligence, 1992.
  • Sutton. et al. (1998) Sutton., Precup, and Singh. Intra-option learning about temporally abstract actions. In International Conference on Machine Learning, 1998.
  • Sutton et al. (2000) Sutton, McAllester, Singh, and Mansour. Policy gradient methods for rl with function approximation. In Neural Information Processing Systems, 2000.
  • Tassa et al. (2018) Tassa, Doron, Muldal, Erez, Li, L. Casas, Budden, Abdolmaleki, Merel, Lefrancq, Lillicrap, and Riedmiller. Control suite. 2018.
  • Tieleman and Hinton (2012) T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 2012.
  • van Hasselt et al. (2016) van Hasselt, Guez, Hessel, Mnih, and Silver. Learning values across many orders of magnitude. In Neural Information Processing Systems, 2016.
  • van Hasselt et al. (2016) H. van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double Q-learning. In AAAI Conference on Artificial Intelligence, pages 2094–2100, 2016.
  • Wang et al. (2016) Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas. Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning, 2016.
  • Wiering and Schmidhuber (1997) M. Wiering and J. Schmidhuber. HQ-learning. Adaptive Behavior, 6(2):219–246, 1997.
  • Williams (1992) Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992.
  • Xu et al. (2018) Xu, van Hasselt, and Silver. Meta-gradient reinforcement learning. CoRR, abs/1805.09801, 2018.

Appendix

Appendix A Training Details

We performed very limited tuning on Atari, both due to the cost of running so many comparison with 8 seeds at scale across 57 games, and because we were interested in generalization to a different domain. We used a learning rate of 1e-3, an entropy cost of 0.01 and a baseline cost of 0.5. The learning rate was selected among 1e-3, 1e-4, 1e-5 in an early version of the agent without any of the adaptive solutions, and verified to be reasonable as we were adding more components. Similarly the entropy cost was selected between 0.1 and 0.01. The baseline cost was not tuned, but we used the value of 0.5 that is common in the literature. We used the TensorFlow default settings for the additional parameters of the RMSProp optimizer.

The learning updates were batched across rollouts of 150 agent steps for 16 parallel copies of the environment. All experiments used gradient clipping by norm, with a maximum norm of 5., as in Wang et al. (2016). The adaptive agent could choose to repeat each action up to 6 times. PopArt used a step-size of 3e-4, with bounds on the scale of 1e-4 and 1e6 for numerical stability, as reported by Hessel et al. (2018b). We always used the undiscounted return to compute the meta-gradient updates, and when using the adaptive solutions these would update both the discount γ and the trace λ, as in the experiments by Xu et al. (2018). The meta-updates were computed on smaller rollouts of 15 agent steps, with a meta-learning rate of 1e-3. No additional tuning was performed for any of the experiments on the Control Suite.

Appendix B Experiment Details

In Figure 4 we report the detailed learning curves for all Atari games for three distinct agents: the fully adaptive agent (in red), the agent with fixed action repeats (in green), and the agent acting at the fastest timescale (in blue). It’s interesting to observe how on a large number of games (ice_hockey, jamesbond, robotank, fishing, double_dunk, etc) the agent with no action repeats seems to learn stably and effectively up to a point, but then plateaus quite abruptly. This suggests that exploration might be a major reason for the overall reduced aggregate performance of this agent in the later stages of training.

In Figure 5 and 6 we show the discount factors and the eligibility traces λ as they were meta-learned over the course of training by the fully adaptive agent. We plot the soft time horizons T=(1-γ)-1 and T=(1-λ)-1. It’s interesting to observe how diverse these are for the various games. Consider for instance the discount factor γ: in games such as robotank, bank_heist, and tutankham the horizon increases up to almost 1000, or even above, while in other games, such as surround and double_dunk the discount reduces over time. Also for the eligibility trace parameter λ we observe very diverse values in different games. In Figure 7 we report the learning curves for all 28 tasks in the Control Suite (the 10 selected for the main text are those for which the difference in performance between the adaptive agent and the heuristic agent is greater).

Figure 4: Performance on all Atari games of the fully adaptive agent (in red), the agent with fixed action repeats (in green), and the agent acting at the fastest timescale (in blue).
Figure 5: For each Atari game, the associated plot shows the time horizon T=(1-γ)-1 for the discount factor γ that was adapted across the 200 million training frames.
Figure 6: For each Atari game, the associated plot shows the time horizon T=(1-λ)-1 for the eligibility trace λ that was adapted across the 200 million training frames.
Figure 7: Performance on all control suite tasks for the fully adaptive agent (in red), the agent trained with the Atari heuristics (in green), and an A3C agent (black horizontal line) as tuned specifically for the Control Suite by Tassa et al. (2018).