Striving for Simplicity in Off-policy Deep Reinforcement Learning

  • 2019-07-10 07:23:27
  • Rishabh Agarwal, Dale Schuurmans, Mohammad Norouzi
  • 10

Abstract

Reflecting on the advances of off-policy deep reinforcement learning (RL)algorithms since the development of DQN in 2013, it is important to ask: arethe complexities of recent off-policy methods really necessary? In an attemptto isolate the contributions of various factors of variation in off-policy deepRL and to help design simpler algorithms, this paper investigates a set ofrelated questions: First, can effective policies be learned given only accessto logged offline experience? Second, how much of the benefits of recentdistributional RL algorithms is attributed to improvements in explorationversus exploitation behavior? Third, can simpler off-policy RL algorithmsoutperform distributional RL without learning explicit distributions overreturns? This paper uses a batch RL experimental setup on Atari 2600 games toinvestigate these questions. Unexpectedly, we find that batch RL algorithmstrained solely on logged experiences of a DQN agent are able to significantlyoutperform online DQN. Our experiments suggest that the benefits ofdistributional RL mainly stem from better exploitation. We present a simple andnovel variant of ensemble Q-learning called Random Ensemble Mixture (REM),which enforces optimal Bellman consistency on random convex combinations of theQ-heads of a multi-head Q-network. The batch REM agent trained offline on DQNdata outperforms the batch QR-DQN and online C51 algorithms.

 

Quick Read (beta)

Striving for Simplicity in
Off-policy Deep Reinforcement Learning

Rishabh Agarwal , Dale Schuurmans𝓎, Mohammad Norouzi
{rishabhagarwal, schuurmans, mnorouzi}@google.com
Google Research, Brain Team
Work done as part of the Google AI Residency Program.𝓎Also at the University of Alberta.
Abstract

Reflecting on the advances of off-policy deep reinforcement learning (RL) algorithms since the development of DQN in 2013, it is important to ask: are the complexities of recent off-policy methods really necessary? In an attempt to isolate the contributions of various factors of variation in off-policy deep RL and to help design simpler algorithms, this paper investigates a set of related questions: First, can effective policies be learned given only access to logged offline experience? Second, how much of the benefits of recent distributional RL algorithms is attributed to improvements in exploration versus exploitation behavior? Third, can simpler off-policy RL algorithms outperform distributional RL without learning explicit distributions over returns? This paper uses a batch RL experimental setup on Atari 2600 games to investigate these questions. Unexpectedly, we find that batch RL algorithms trained solely on logged experiences of a DQN agent are able to significantly outperform online DQN. Our experiments suggest that the benefits of distributional RL mainly stem from better exploitation. We present a simple and novel variant of ensemble Q-learning called Random Ensemble Mixture (REM), which enforces optimal Bellman consistency on random convex combinations of the Q-heads of a multi-head Q-network. The batch REM agent trained offline on DQN data outperforms the batch QR-DQN and online C51 algorithms.

 

Striving for Simplicity in
Off-policy Deep Reinforcement Learning


  Rishabh Agarwalthanks: Work done as part of the Google AI Residency Program.yAlso at the University of Alberta. , Dale Schuurmansy, Mohammad Norouzi {rishabhagarwal, schuurmans, mnorouzi}@google.com Google Research, Brain Team

\@float

noticebox[b]Preprint. Under review.\[email protected]

1 Introduction

Deep neural networks have become a critical component of modern reinforcement learning (RL) [53]. The seminal work of Mnih et al. [40, 41] on deep Q-networks (DQN) has demonstrated that it is possible to train convolutional networks [35] using Q-learning [61] and achieve human-level performance in playing Atari 2600 games [3] directly from raw pixels. Recent progress in mastering the games of Go [50] and Poker [43, 10] and advances in robotic control [36, 44, 30] present additional supporting evidence for the enormous potential of deep RL. Nevertheless, the use of neural networks within RL algorithms introduces unique and complex challenges, especially in the context of off-policy learning. For instance, it is well known that Q-learning can be unstable or even divergent when neural networks are used to parameterize optimal action-value functions [2, 9, 58]. When discussing off-policy methods with function approximation, Sutton and Barto [53] conclude that: “The potential for off-policy learning remains tantalizing, the best way to achieve it still a mystery.”

Off-policy RL algorithms are attractive as they disentangle data collection and policy optimization. This enables re-using the experience collected by any policy to improve the value estimates under the optimal policy, as utilized by value iteration [23, 7] and Q-learning [61]. By contrast, on-policy policy gradient methods [62, 54, 29, 49, 42] require access to samples drawn from the current policy to directly estimate the gradient of the objective function w. r. t. the current parameters. On-policy methods are typically convergent and stable, but require many more interactions with the environment than off-policy techniques to achieve a similar level of performance [22]. More importantly, on-policy methods are not applicable to many practical RL problems [52, 8, 55] for which offline logged data already exists before policy learning proceeds. By contrast, off-policy techniques are capable of leveraging the vast amount of existing offline data to tackle real-world problems.

In the pursuit of developing efficient and stable off-policy deep RL algorithms, recent papers have presented various conceptual and engineering ideas (e.g., [41, 48, 59, 60, 45, 19, 1, 4, 13, 22]). In the absence of theoretical guarantees for off-policy learning with function approximation, most advances in this area are largely governed by empirical results on a popular benchmark suite of Atari 2600 games [3]. Reflecting on the advances of deep RL algorithms since the early version of DQN in 2013 [40], it is important to ask: are the complexities of recent off-policy methods really necessary? Given the empirical nature of the field, we believe it is crucial to study the relative importance of individual components of our RL algorithms and strive for finding successful RL algorithms that are as simple as possible (the principle of Occam’s razor).

In an attempt to isolate various factors of variation in off-policy deep RL and help develop simpler algorithms, this paper investigates a set of interrelated questions:

  1. 1.

    Separating the contribution of offline versus online data helps isolate an RL algorithm’s ability to exploit experience and generalize versus its ability to explore effectively. Is it possible to train successful Atari agents completely in isolation solely based on offline data?

  2. 2.

    How much of the gain of recent distributional RL algorithms such as C51 [4] and QR-DQN [13] is attributed to improvements in their exploration versus exploitation behavior? In other words, are distributional RL algorithms much better than DQN if trained on the same offline data?

  3. 3.

    Can simpler off-policy algorithms that avoid learning explicit distributions over returns capture the benefits of distributional RL? Ideally, one prefers algorithms that have better theoretical characteristics than distributional Q-learning algorithms.

(a) (b)
Figure 1: (a) Median normalized scores (11) of batch agents trained using data collected from a DQN agent across 60 Atari 2600 games, and (b) Number of Atari games where a batch agent’s online performance is greater than a fully trained online DQN agent as a function of training time.

By investigating the questions above, we make several contributions to off-policy deep RL research:

  • We propose a deceptively simple experimental setup for evaluating off-policy deep RL algorithms based on logged experiences of a DQN agent. This helps reduce the computation cost of the experiments significantly and enables comparing exploitation behavior of off-policy agents in isolation. We will release the offline data used in our experiments to help reproduce our results and enable the research community evaluate off-policy methods on common ground.

  • Unexpectedly, we find that the full experience replay of a DQN agent comprising about 50 million tuples of (observation, action, reward, next observation) is enough to learn strong Atari agents completely in isolation without any interaction with the environment during training.

  • We confirm that batch QR-DQN significantly outperforms batch DQN when using identical offline data. We also confirm that QR-DQN outperforms an ensemble-DQN baseline, which optimizes Q-heads of a multi-headed Q-network based on separate TD errors for each head.

  • We present Random Ensemble Mixture (REM), a novel and simple extension of ensemble-DQN, which enforces optimal Bellman consistency on random convex combinations of the Q-heads of a multi-headed Q-network. Surprisingly, as Figure 1 shows, the batch REM agent trained offline on DQN data outperforms the batch QR-DQN [13] and the online C51 [4] agents both in terms of the median normalized scores and the number of games superior to DQN.

2 Background

In reinforcement learning (RL), an agent interacts with an environment, which is typically expressed as a Markov decision process (MDP) (𝒮,𝒜,R,P,γ) [46], with a state space 𝒮, an action space 𝒜, a stochastic reward function R(s,a), transition dynamics P(s|s,a) and a discount factor γ[0,1). A stochastic policy π(|s) maps each state s𝒮 to a distribution over actions.

For an agent following the policy π, the action-value function denoted Qπ(s,a) is defined as the expectation of cumulative discounted future rewards, i.e.,

Qπ(s,a)=𝔼[t=0γtR(st,at)|s0=s,a0=a,stP(|st-1,at-1),atπ(|st)]. (1)

The action-value function can be recursively defined in terms of the Bellman equation [7] as

Qπ(s,a)=𝒯πQπ(s,a):=𝔼[R(s,a)+γQπ(s,a)]. (2)

Our goal is to find an optimal policy π* that attains the maximum expected return, Qπ*(s,a)Qπ(s,a) for all π,s,a. One way to characterize the optimal policy π* is via the Bellman optimality equations, which recursively define the optimal action-value function denoted Q*=Qπ* via:

Q*(s,a)=𝒯Q*(s,a):=𝔼[R(s,a)+γmaxaQ*(s,a)]. (3)

To this end, Q-learning [61] iteratively improves an approximate estimate of Q* denoted Qθ by repeatedly regressing the LHS of (3) to samples from the RHS of (3). For large and complex state spaces, approximate Q-values are obtained using a neural network as the function approximator. To further stabilize learning, a target network Qθ with frozen parameters is used for computing the learning target. The target network parameters θ are updated to the current Q-network parameters θ after a fixed number of time steps. DQN [41] parameterizes Qθ with a convolutional neural network and uses Q-learning with a target network while following an ϵ-greedy policy over Qθ for data collection. DQN minimizes the TD error loss in (4) over mini-batches of tuples of the agent’s past data, (s,a,r,s), sampled from an experience replay buffer  [37] collected during training. This approach achieves human-level play on the Atari 2600 games [3].

(θ)=𝔼s,a,r,s𝒟[lδ(Qθ(s,a)-r-γmaxaQθ(s,a))], (4)

where lδ is the Huber loss given by lδ(u)={12u2for |u|δ,δ(|u|-12δ),otherwise..

Q-learning is an off-policy algorithm [53], meaning the target can be computed without consideration of how the experience was generated. In principle, off-policy RL algorithms can learn from data collected by any behavioral policy. In batch RL [34], we assume access to a fixed offline dataset of experiences without further interactions with the environment.

Distributional RL [4] makes use of a distribution over returns, denoted Z*(s,a), instead of the scalar action-value function Q*(s,a), which is the expectation of Z*(s,a). Accordingly, similar to the scalar setting, a distributional Bellman optimality operator is defined as

𝒯Z*(s,a):=DR(s,a)+γZ*(s,argmaxa𝒜𝔼Z*(s,a)), (5)

where =D denotes distributional equality and s is distributed according to P(|s,a).

The C51 algorithm by Bellemare et al. [4] was the first distributional RL algorithm, which achieved state-of-the-art performance on the Atari 2600 games. C51 models Z*(s,a) using a categorical distribution supported on a set of fixed locations z1zK uniformly spaced over a pre-determined interval. The parameters of that distribution are the probabilities qi, associated with each location zi. Given a current value distribution Z, the C51 algorithm applies a projection step to map the target 𝒯Z onto its finite element support, followed by a Kullback-Leibler minimization step.

Dabney et al. [13] proposed to use quantile regression for distributional RL and developed QR-DQN which outperformed C51. QR-DQN minimizes the Wasserstein distance to the distributional Bellman target and approximates the random return by a uniform mixture of K Dirac distributions, i.e.,

Zθ(s,a):=1Ki=1Kδθi(s,a),andQθ(s,a)=1Ki=1Kθi(s,a) (6)

with each θi being trained to match the τ^i-quantile of the target return distribution using the Huber [24] quantile regression loss, where τ^i=2i-12K for 1iK.

3 Can Pure Off-policy Learning with no Environment Interactions Succeed?

Off-policy learning with function approximation has no known convergence guarantees [53], and there are well-known counter examples [2] where Q-learning diverges even with linear function approximation. Despite the lack of theoretical guarantees, modern “off-policy” deep RL algorithms perform quite well on commonly used benchmarks such as the Arcade Learning Environment (ALE) [3] and MuJoCo locomotion tasks [57]. Most of these algorithms fall into the category of growing batch learning [34], in which data is collected and stored into an experience replay buffer [37], which is used to train the agent before further data collection.

Fujimoto et al. [20] note that these “off-policy” algorithms tend to use near on-policy exploratory policies and present results suggesting that these algorithms are ineffective when learning truly off-policy without interacting with the environment. Previously, Zhang and Sutton [63] advocated that a large replay buffer can significantly hurt the performance of Q-learning algorithms. These results raise the question that if we had access to all of the data collected during training from a deep off-policy RL agent, can pure off-policy learning without any environment interactions succeed?

To answer this question, we train a standard off-policy agent, namely Nature DQN [41], on 60 Atari 2600 games for 200 million frames (standard protocol) and save all of the experience tuples of (observation, action, reward, next observation) encountered during training. We then use these logged experiences for training off-policy agents in an offline setting (i.e., batch RL [34]) without any new interaction with the environment during training. For most of the experiments, we train all of the offline agents for the same number of gradient updates as the online agents, but we also consider using four times as many gradient updates for offline agents given that the sample-efficiency of an offline agent does not change with the number of training epochs.

The proposed offline setting for evaluating off-policy RL algorithms is much closer to supervised learning and simpler than the typical online setting. For example, in the offline setting, we optimize a training objective over a fixed dataset as compared to the non-stationary objective over a changing experience replay buffer for an online off-policy RL algorithm. This simplicity allows us to segregate the problems of exploration and exploitation in off-policy RL and study them independently.

3.1 Experimental Setup & Results

(a) (b)
Figure 2: Normalized Performance improvement (in %), per game, of (a) batch DQN (b) batch QR-DQN trained offline using DQN over a online DQN agent.

We use DQN [41] as our baseline agent to collect the full experience datasets DQN. For each game, we repeat this dataset collection process 5 times for statistical significance and report the results averaged over 5 runs unless stated otherwise. We used the hyperparameters provided in the Dopamine [11] baselines for a standardized comparison (refer to supplementary material for more details). Note that we use the stochastic version of ALE with sticky actions [39] to ward against trajectory overfitting. We compare the following agents trained offline on the fixed dataset DQN:
 Batch DQN: Can we recover the online DQN agent’s performance in the offline setting? To answer this question, we train a DQN agent offline on the data collected from online DQN (i.e., DQN).
 Batch QR-DQN: One may ask whether it is possible to exploit the data collected from online DQN more effectively using a better off-policy algorithm than DQN. To answer this question, we train agents offline on DQN using the distributional RL algorithm QR-DQN [13].

We measure the online performance of the batch agents on a score scale normalized with respect to the random and online DQN agents. The normalized score (11) for each game is one for the better performing agent among the online DQN and the random agent and zero for the worse among the DQN and the random agent.

Results. Batch DQN achieves at least 74% of online DQN’s score improvement over a random agent on half of the games and outperforms online DQN on only 10 Atari games (Figure 2(a)) when trained for the same number of gradient steps as the online DQN agent. Quite surprisingly, batch QR-DQN respectively outperforms online DQN and batch DQN on 44 and 51 games (Figure 2(b)). We also ran batch C51 on DQN and observed that it also significantly outperforms batch DQN (Figure 7). Additionally, we performed similar experiments on data collected from an online QR-DQN agent and observe that batch QR-DQN significantly outperforms batch DQN (Figure 6). The learning curves of the online DQN agent and the batch DQN, batch QR-DQN trained using DQN on all 60 games can be visualized in Figure 9 in the appendix.

The suboptimal performance of batch DQN suggests that the DQN agent is not very effective at exploiting off-policy data. More importantly, the impressive performance of batch QR-DQN indicates that learning good policies completely offline using logged DQN data is possible on most of the Atari 2600 games. Interestingly, a significant fraction of the gains from distributional RL agents (C51, QR-DQN) can be attributed to their exploitation capacity and not exploration behavior. The diversity of states in the experience replay buffer is an important factor for performance [14] and our results indicate that 50 million experience tuples from a DQN agent are diverse enough to obtain good performance on most of the Atari 2600 games.

4 Distilling the Success of Distributional RL into Simpler Algorithms

Figure 3: Network architectures for DQN, QR-DQN and the proposed expected RL variants with QR-DQN architecture, i.e., Ensemble-DQN and REM.

The source of the gains from combining non-linear function approximation with QR-DQN and from extending RL to distributional RL remain unclear [38]. Is modelling the full distribution over expected returns crucial to achieve the gains of distributional RL? Can we distill the success of distributional RL to simpler off-policy methods that avoid learning explicit distributions over returns? Our results in the previous section confirm that the benefits of distributional RL mainly stem from better exploitation behavior. Using this new insight, we try to shed light on these questions.

In order to approximate return distributions, QR-DQN modifies the DQN architecture to output K heads for each action, where each head is trained separately to represent a specific quantile value of the return distribution, and the average of the heads is used as the Q-value (6). As shown in Figure 3, the individual heads share all of the neural network layers except the final fully connected layer. We propose two simple variants of deep Q-learning using the same architecture changes over DQN employed by QR-DQN but with objective functions for minimizing the TD error of the scalar approximation to the Q-function rather than return distributions ((7), (8)).

4.1 Ensemble-DQN

Ensemble-DQN is a simple extension of DQN which approximates the Q-values via an ensemble of Q-functions [18, 45, 1]. We implement this by modifying the DQN network to output K heads which are used to estimate K Q-functions. Each one of these Q-value heads, denoted Qθk(s,a), is trained against its own target head Qθk(s,a) similar to Bootstrapped-DQN [45]. Note that our Ensemble-DQN is different from the ensembling proposed in [1] which uses the average target Q-values for training the individual Q-functions. The Q-heads with different random initialization are optimized with complete data sharing between heads using the loss (θ) given by

(θ)=1Kk=1K𝔼s,a,r,s𝒟[lδ(Qθk(s,a)-r-γmaxaQθk(s,a))], (7)

where lδ is the Huber loss.

Bootstrapped-DQN uses the Q-value estimates to improve temporally-extended exploration, while Ensemble-DQN simply uses the mean of the K estimates as the Q-value. In the offline learning setup, we are only concerned with the exploitation ability of Ensemble-DQN.

(a) (b)
Figure 4: Normalized Performance improvement (in %) , per game, of (a) batch Ensemble-DQN (b) batch REM trained offline using DQN over a online DQN agent.

4.2 Random Ensemble Mixture (REM)

Increasing the number of models used for ensembling and their diversity [33] tends to improve performance in supervised learning. We noticed a similar trend in Ensemble-DQN experiments, i.e., ensembling a larger number of Q-value estimates tends to improve performance. This observation begs the question of whether we can use a ensemble over an exponential number of Q-estimates in a computationally efficient manner.

Inspired by Dropout [51], we propose Random Ensemble Mixture for RL, which uses multiple Q-function heads to estimate the Q-value, similar to Ensemble-DQN. The key insight in REM is that since each Q-head represents a valid estimate of the Q-values, a convex combination of the Q-heads is also a valid approximator for the Q-function. This is especially true at the fixed point, where all of the Q-heads must converge to identical Q-value estimates. Using this insight, we train a family of Q-functions over a (K-1)-simplex. Specifically, for each mini-batch, we randomly draw a categorical distribution that defines a convex combination of the Q-heads to define a Q-function. This Q-function is trained against its corresponding target to define the TD error. The loss (θ) takes the form,

(θ)=𝔼s,a,r,s𝒟[𝔼α1,,αKPΔ[lδ(kαkQθk(s,a)-r-γmaxakαkQθk(s,a))]] (8)

where PΔ represents a probability distribution over the standard (K-1)-simplex ΔK-1={αK:α1+α2++αK=1,αk0,k=1,,K}. For evaluation, we use the average of the K heads as the Q-function, i.e., Q(s,a)=kQθk(s,a)/K.

Proposition 1. Consider the assumptions: (a) The distribution PΔ has full support over the entire (K-1)-simplex. (b) Only a finite number of distinct Q-functions globally minimize the loss in (4). (c) Q* is defined in terms of the MDP induced by the data distribution D. (d) Q* lies in the family of our function approximation. Then at the global minimum of L(θ) (8):

  1. 1.

    Under assumptions (a) and (b), all the Q-heads represent identical Q-functions.

  2. 2.

    Under assumptions (a)–(d), the common convergence point is Q*.

The proof of (ii) follows from (i) and the fact that (8) is lower bounded by zero and by definition Q* attains a TD error of zero. The proof of part (i) can be found in the supplementary material. In our experiments, we use a very simple distribution PΔ without any tuning: we first draw a set of K values i. i. d. from Uniform(0, 1) and normalize them to get a valid categorical distribution, i.e., αkU(0,1) followed by αk=αk/αi.

It can be inferred from Proposition 1 that despite using an architecture with multiple heads, REM does not change the representation capacity over DQN. Also, one can view (θ) as an infinite set of Bellman optimality constraints, and each constraint provides an auxiliary objective for training [26, 5]. We hypothesize that gains from over-parametrization in REM over DQN, if any, can be attributed to its training procedure.

4.3 Experiments & Results

We train batch Ensemble-DQN and batch REM offline using the experience replay data of a DQN agent (i.e., DQN). All of the offline agents including QR-DQN, Ensemble-DQN, and REM use an identical architecture with K=200 heads and the same number of parameters. We use the same number of SGD updates as the online DQN agent for the algorithms compared in this subsection.

Batch Ensemble-DQN significantly outperforms batch DQN and achieves a higher score than online DQN on 26 games (Figure 4(a)). This indicates that a noticeable proportion of gains achieved by QR-DQN is due to training multiple heads to estimate the Q-values. Batch REM outperforms online DQN and batch Ensemble-DQN in 35 and 47 games respectively (Figure 4(b)).

4.4 Asymptotic Performance of Batch Agents

In the offline setting, since the training dataset is fixed, the sample efficiency of the batch agents remain unchanged if trained for longer duration. This leads to the question of whether there would be any significant performance gain at the expense of increased computational efficiency for training the batch agents. To answer this, we train all the batch agents on dataset collected from online DQN (DQN) for four times as many gradient steps (indicated by the suffix (4X) after the name of a batch agent) as compared to the online DQN (or C51) agent.

Figure 1 reveals that the batch agents are able to exploit the data much more effectively when trained for more gradient updates. Interestingly, batch REM (4X) achieves a higher score than online DQN as on more games as well as a higher median normalized score as compared to online C51. It is worthwhile to note that batch REM (4X) is able to outperform batch QR-DQN (4X). Also, batch REM (4X) and batch QR-DQN (4X) outperform C51 on approximately half of the games (see Figure 8 in the appendix). These results indicate that the success of distributional RL algorithms such as QR-DQN and C51 can be distilled to the simpler REM algorithm, at least in the batch setting.

Figure 5: Average returns (scores) of DQN, Ensemble-DQN, QR-DQN and REM agents trained using the batch setting on 5 Atari 2600 games using logged data from a online DQN agent (i.e., DQN). The horizontal line shows the best performance of the online DQN agent. The returns (scores) are averaged over 5 runs, smoothed over a sliding window of 3 iterations, and error bands show 95% confidence interval.

Figure 5 shows that the batch agents are able to obtain much higher scores compared to their training data (i.e.,BDQN) on some of the games. Analogous to supervised learning, all of the batch agents exhibit overfitting, i.e., after certain number of gradient updates, the TD error on the dataset DQN (not plotted here) continues to decrease while the online performance deteriorates significantly. The learning curves comparing DQN, QR-DQN, Ensemble-DQN and REM agents trained offline using DQN on all 60 games can be visualized in Figure 10 in the appendix.

5 Related work

In an attempt to develop efficient and stable off-policy deep RL algorithms, recent papers have presented various conceptual and engineering contributions. DQN [41] proposes the use of experience replay and target networks. Bootstrapped DQN [45] and NoisyNet [19] suggest using randomized value functions to select more exploratory actions. Averaged DQN [1] suggests using a running average of several target networks as the target value function. Distributional RL [4, 13, 12] proposes using a distribution to represent optimal action-values rather than a point estimate. Rainbow [22] suggests combining the improvement of different techniques such as distributional RL, prioritized experience replay [48], double Q-learning [59, 21], dueling architecture [60], etc., into a single off-policy agent. We are investigating whether the complexity of some of these prior approaches is necessary. Our work is similar in spirit to Rajeswaran et al. [47], which attempts to demystify the complexity of on-policy deep RL for continuous control. Our work is mainly related to two subareas of RL known as Batch RL and Distributional RL.

Batch Reinforcement Learning. RL on a fixed dataset without any online interactions with the environment has been long studied [16, 31, 52, 34, 55]. While there has been increasing interest in batch off-policy RL over the last few years [28, 56, 17, 25, 27], much of this focussed on off-policy policy evaluation, where the goal is to estimate the performance of a given target policy. Our work tackles the problem of batch off-policy optimization which requires learning a good policy given a fixed dataset. Recent work from Fujimoto et al. [20] as well as concurrent work from Kumar et al. [32] document that standard off-policy RL methods with static datasets fail on continuous control tasks. In [20], this failure is attributed to extrapolation error, i.e., the mismatch between the batch dataset and true state-action visitation of the current policy. We assert that the failure on off-policy expert data reported in [32] is analogous to the difficulty of learning a binary classifier using only positive training examples. Additionally, our positive results with standard off-policy agents suggest that failure to learn from diverse offline data is not necessarily an indication of issues with batch RL (e.g., see [20]), but it maybe linked to extrapolation error for weaker exploitation agents, e.g., DQN.

Distributional Reinforcement Learning. Recently, distributional RL algorithms [4, 13, 12], especially C51 [4], gained enormous popularity due to their impressive performance on the ALE benchmark [3]. Dabney et al. [13] proposed QR-DQN allowing for a more flexible parameterization for modeling the return distribution, leading it to outperform C51. Despite these algorithmic advances, our understanding of performance gains due to practical distributional RL methods remains limited [38, 6]. In this work, we investigate the QR-DQN algorithm, as it only modifies the architecture (Figure 3) and the regression loss objective used in DQN, and therefore, easier to study than C51. We demonstrate that the gains from QR-DQN mainly stem from its exploitation ability and distill the success of QR-DQN to REM (Section 4.2), a much simpler off-policy expected RL algorithm.

6 Conclusion

This paper investigates off-policy deep RL for learning to play the Atari 2600 games from offline replay data. We conclude that it is possible to learn successful polices that significantly outperform the policy used to collect replay data. We demonstrate the superior exploitation capability of distributional RL algorithms (e.g., batch QR-DQN) over batch DQN and batch Ensemble-DQN. We propose REM, a simple and effective variant of Ensemble-DQN that outperforms batch QR-DQN and online C51 when trained only using the replay data of DQN. The proposed offline experimental setup can serve as the testbed for evaluating off-policy algorithms and enable the research community to evaluate off-policy methods on common grounds.

For future work, analyzing the effect of the distribution over the simplex in REM and potentially learning the distribution using an adversarial objective are promising directions. We believe that REM holds significant potential in the online setting and combining its exploitation ability with ensemble exploration strategies can potentially lead to a substantial improvement. That said, this paper focuses on the exploitation ability of REM, and we leave online REM to future work.

In conclusion, our results present an optimistic view that simple RL algorithms can be developed which can effectively learn from large-scale off-policy datasets, enabling rapid progress similar to the one caused by datasets such as ImageNet [15] in supervised learning.

7 Acknowledgments

We thank Pablo Samuel Castro for help in understanding and debugging certain issues with the Dopamine codebase and reviewing an early draft of the paper. We thank Robert Dadashi, Carles Gelada and Liam Fedus for helpful discussions. We also acknowledge William Chan and Aviral Kumar for their review of initial draft of the paper.

References

  • Anschel et al. [2017] Oron Anschel, Nir Baram, and Nahum Shimkin. Averaged-dqn: Variance reduction and stabilization for deep reinforcement learning. ICML, 2017.
  • Baird [1995] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. Machine Learning, 1995.
  • Bellemare et al. [2013] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 2013.
  • Bellemare et al. [2017] Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. ICML, 2017.
  • Bellemare et al. [2019a] Marc G Bellemare, Will Dabney, Robert Dadashi, Adrien Ali Taiga, Pablo Samuel Castro, Nicolas Le Roux, Dale Schuurmans, Tor Lattimore, and Clare Lyle. A geometric perspective on optimal representations for reinforcement learning. arXiv preprint arXiv:1901.11530, 2019a.
  • Bellemare et al. [2019b] Marc G Bellemare, Nicolas Le Roux, Pablo Samuel Castro, and Subhodeep Moitra. Distributional reinforcement learning with linear function approximation. arXiv preprint arXiv:1902.03149, 2019b.
  • Bellman [1966] Richard Bellman. Dynamic programming. Science, 1966.
  • Bottou et al. [2013] Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. JMLR, 2013.
  • Boyan and Moore [1995] Justin A Boyan and Andrew W Moore. Generalization in reinforcement learning: Safely approximating the value function. NIPS, 1995.
  • Brown and Sandholm [2017] Noam Brown and Tuomas Sandholm. Libratus: The superhuman ai for no-limit poker. IJCAI, 2017.
  • Castro et al. [2018] Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G Bellemare. Dopamine: A research framework for deep reinforcement learning. ArXiv:1812.06110, 2018.
  • Dabney et al. [2018a] Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. ICML, 2018a.
  • Dabney et al. [2018b] Will Dabney, Mark Rowland, Marc G Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. AAAI, 2018b.
  • De Bruin et al. [2015] Tim De Bruin, Jens Kober, Karl Tuyls, and Robert Babuška. The importance of experience replay database composition in deep reinforcement learning. Deep reinforcement learning workshop, NIPS, 2015.
  • Deng et al. [2009] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. CVPR, 2009.
  • Ernst et al. [2005] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. JMLR, 2005.
  • Farajtabar et al. [2018] Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. ICML, 2018.
  • Faußer and Schwenker [2015] Stefan Faußer and Friedhelm Schwenker. Neural network ensembles in reinforcement learning. Neural Processing Letters, 2015.
  • Fortunato et al. [2018] Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, et al. Noisy networks for exploration. ICLR, 2018.
  • Fujimoto et al. [2019] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. ICML, 2019.
  • Hasselt [2010] Hado V Hasselt. Double q-learning. NIPS, 2010.
  • Hessel et al. [2018] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. AAAI, 2018.
  • Howard [1960] Ronald A Howard. Dynamic programming and markov processes. MIT press, 1960.
  • Huber [1964] PJ Huber. Robust estimation of a location parameter. Ann. Math. Stat., 1964.
  • Irpan et al. [2019] Alex Irpan, Kanishka Rao, Konstantinos Bousmalis, Chris Harris, Julian Ibarz, and Sergey Levine. Off-policy evaluation via off-policy classification. arXiv preprint arXiv:1906.01624, 2019.
  • Jaderberg et al. [2017] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. ICLR, 2017.
  • Jaques et al. [2019] Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019.
  • Jiang and Li [2016] Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. ICML, 2016.
  • Kakade [2002] Sham M Kakade. A natural policy gradient. NIPS, 2002.
  • Kalashnikov et al. [2018] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. CoRL, 2018.
  • Kalyanakrishnan and Stone [2007] Shivaram Kalyanakrishnan and Peter Stone. Batch reinforcement learning in a complex domain. AAMAS, 2007.
  • Kumar et al. [2019] Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. arXiv preprint arXiv:1906.00949, 2019.
  • Kuncheva and Whitaker [2003] Ludmila I Kuncheva and Christopher J Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine learning, 2003.
  • Lange et al. [2012] Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. Reinforcement learning, 2012.
  • LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
  • Levine et al. [2016] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. JMLR, 2016.
  • Lin [1992] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 1992.
  • Lyle et al. [2019] Clare Lyle, Pablo Samuel Castro, and Marc G Bellemare. A comparative analysis of expected and distributional reinforcement learning. AAAI, 2019.
  • Machado et al. [2018] Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 2018.
  • Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv:1312.5602, 2013.
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 2015.
  • Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. ICML, 2016.
  • Moravčík et al. [2017] Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisỳ, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 2017.
  • OpenAI et al. [2018] OpenAI, Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafał Józefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, Jonas Schneider, Szymon Sidor, Josh Tobin, Peter Welinder, Lilian Weng, and Wojciech Zaremba. Learning dexterous in-hand manipulation. CoRR, 2018.
  • Osband et al. [2016] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. NIPS, 2016.
  • Puterman [1994] Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994.
  • Rajeswaran et al. [2017] Aravind Rajeswaran, Kendall Lowrey, Emanuel V Todorov, and Sham M Kakade. Towards generalization and simplicity in continuous control. NIPS, 2017.
  • Schaul et al. [2016] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. ICLR, 2016.
  • Schulman et al. [2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. ICML, 2015.
  • Silver et al. [2016] David Silver, Aja Huang, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016.
  • Srivastava et al. [2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014.
  • Strehl et al. [2010] Alexander L. Strehl, John Langford, Lihong Li, and Sham Kakade. Learning from logged implicit exploration data. NIPS, 2010.
  • Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Sutton et al. [2000] Richard S Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. NIPS, 2000.
  • Swaminathan and Joachims [2015] Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. JMLR, 2015.
  • Thomas and Brunskill [2016] Philip Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. ICML, 2016.
  • Todorov et al. [2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. IROS, 2012.
  • Tsitsiklis and Van Roy [1997] John N Tsitsiklis and Benjamin Van Roy. Analysis of temporal-diffference learning with function approximation. NIPS, 1997.
  • Van Hasselt et al. [2016] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. AAAI, 2016.
  • Wang et al. [2016] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. ICLR, 2016.
  • Watkins and Dayan [1992] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine Learning, 1992.
  • Williams [1992] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992.
  • Zhang and Sutton [2017] Shangtong Zhang and Richard S. Sutton. A deeper look at experience replay. arXiv preprint arXiv:1712.01275, 2017.

8 Supplement

8.1 Proofs

Proposition 1. Consider the assumptions: (a) The distribution PΔ has full support over the entire (K-1)-simplex. (b) Only a finite number of distinct Q-functions globally minimize the loss in (4). (c) Q* is defined in terms of the MDP induced by the data distribution D. (d) Q* lies in the family of our function approximation. Then at the global minimum of L(θ) (8):

  1. 1.

    Under assumptions (a) and (b), all the Q-heads represent identical Q-functions.

  2. 2.

    Under assumptions (a)–(d), the common convergence point is Q*.

Proof. Part (i): Under assumptions (a) and (b), we would prove by contradiction that each Q-head should be identical to minimize the REM loss (θ) (8). Note that we consider two Q-functions to be distinct only if they differ on any state s in 𝒟.

The REM loss (θ)=𝔼αPΔ[(α,θ)] where (α,θ) is given by

(α,θ)=𝔼s,a,r,s𝒟[lδ(kαkQθk(s,a)-r-γmaxakαkQθk(s,a))]. (9)

If the heads Qθi and Qθj don’t converge to identical Q-values at the global minimum of (θ), it can be deduced using Lemma 1 that all the Q-functions given by the convex combination αiQθi+αjQθj such that αi+αj=1 minimizes the loss in (4). This contradicts the assumption that only a finite number of distinct Q-functions globally minimize the loss in (4). Hence, all Q-heads represent an identical Q-function at the global minimum of (θ).

Lemma 1. Assuming that the distribution PΔ has full support over the entire (K-1)-simplex ΔK-1, then at any global minimum of L(θ), the Q-function heads Qθk for k=1,,K minimize L(α,θ) for any αΔK-1.

Proof. Let Qα*,θ*=k=1Kαk*Qθ*k(s,a) corresponding to the convex combination α*=(α1*,,αK*) represents one of the global minima of (α,θ) (9) i.e., (α*,θ*)=minα,θ(α,θ) where αΔK-1. Any global minima of (θ) attains a value of (α*,θ*) or higher since,

(θ)=𝔼αPΔ[(α,θ)]𝔼αPΔ[(α*,θ*)](α*,θ*) (10)

Let Qθ*k(s,a)=wθ*kfθ*(s,a) where fθ*(s,a)D represent the shared features among the Q-heads and wθ*kD represent the weight vector in the final layer corresponding to the k-th head. Note that Qα*,θ* can also be represented by each of the individual Q-heads using a weight vector given by convex combination α* of weight vectors (wθ*1,,wθ*K), i.e., Q(s,a)=(k=1Kαk*wθ*k)fθ*(s,a).

Let θI be such that QθIk=Qα*,θ* for all Q-heads. By definition of Qα*,θ*, for all αPΔ, (α,θI)=(α*,θ*) which implies that (θI)=(α*,θ*). Hence, θI corresponds to one of the global minima of (θ) and any global minima of (θ) attains a value of (α*,θ*).

Since (α,θ)(α*,θ*) for any αΔK-1, for any θM such that (θM)=(α*,θ*) implies that (α,θM)=(α*,θ*) for any αPΔ. Therefore, at any global minimum of (θ), the Q-function heads Qθk for k=1,,K minimize (α,θ) for any αΔK-1.

8.2 Score Normalization

The improvement in normalized performance of a batch agent, expressed as a percentage, over a DQN agent is calculated as:

100×ScoreAgent-max(ScoreDQN,ScoreRandom)max(ScoreDQN,ScoreRandom)-min(ScoreDQN,ScoreRandom). (11)

We chose not to measure performance in terms of percentage of DQN performance alone because a tiny difference relative to the random agent on some games can translate into hundreds of percent in DQN performance difference. Additionally, the max is needed since DQN performs worse than a random agent on the games Solaris and Skiing.

Since a C51 agent always outperform the Random agent, the normalized performance over C51 (in %) simplifies to:

100×ScoreAgent-ScoreC51ScoreC51-ScoreRandom. (12)

8.3 Batch Experiments on QR-DQN data

(a) (b)
Figure 6: (a) Median normalized scores of batch agents trained using data collected from a QR-DQN agent, i.e., QR-DQN across 60 Atari games, and (b) Number of Atari games where the evaluation performance of a batch agent trained using QR-DQN is greater than a fully trained online DQN agent as a function of training time.

Similar to our experiments using DQN data, we collected data QR-DQN comprising 200 million frames from an online QR-DQN agent. The learning curves for different batch agents trained using QR-DQN are shown in Figure 11.

8.4 Batch C51 vs Batch DQN

Figure 7: Returns (Scores) for Batch DQN and Batch C51 trained using DQN on 6 Atari games averaged over 5 random seeds. Scores are moving averages over the past 3 million frames.

Figure 7 compares the performance of C51 and DQN in the offline setting using the data collected from an online DQN agent on 6 Atari games. It’s clearly visible from Figure 7 that C51 is much better at exploitation the DQN.

8.5 Hyperparameters & Experiment Details

Our results compare the agents with the same hyperparameters: target network update frequency, frequency at which exploratory actions are selected (ϵ), the length of the schedule over which ϵ is annealed, and the number of agent steps before training occurs. As mentioned in Dopamine’s [11] GitHub repository, changing these parameters can significantly affect performance, without necessarily being indicative of an algorithmic difference. Note that the batch agents trained offline do not collect any new data during training and the training ϵ and ϵ-decay schedule are irrelevant to these agents.

The performance of the agents is compared using the best evaluation score achieved during training where the evaluation is done online using a ϵ-greedy policy with ϵ=0.001. Step size and optimizer were taken as published. Note the multi-headed Q-networks (Figure 3) use similar parameters to the QR-DQN agent in the Dopamine baselines. Table 1 summarizes our choices. All numbers are in ALE frames. For all experiments, we report results averaged over 5 runs with same hyperparameters and different random seeds. Each individual run used a single Tesla P100 GPU. All the online agents were trained for 200 iterations where each iteration corresponds to experiencing 1 million ALE frames when run online.

Table 1: Hyperparameter settings used for our experiments.
Sticky actions Yes
Episode termination Game Over
Training ϵ 0.01
Evaluation ϵ 0.001
ϵ decay schedule (frames) 1,000,000
Min. history to learn (frames) 80,000
Target net. update freq. (frames) 32,000

8.6 Additional Plots & Tables

Figure 8 compares the batch REM (4X) and batch QR-DQN (4X) agents trained using DQN data against the online C51 agent. Table 2 summarizes the results related to the asymptotic performance of batch agents presented in Figure 1.

(a) (b)
Figure 8: Normalized Performance improvement (in %), per game, of (a) batch QR-DQN (b) batch REM trained offline using DQN over a online C51 agent.
Table 2: Median normalized scores (11) across 60 Atari 2600 games, measured as percentages of online DQN scores and number of games where an agent achieves better scores than online DQN. All the batch agents below are trained using BDQN.
Median >DQN
Batch DQN 74% 10
Batch Ensemble-DQN 92% 26
Batch REM 103% 35
Batch QR-DQN 115% 44
Batch DQN (4X) 83% 17
Batch Ensemble-DQN (4X) 111% 38
Batch REM (4X) 123% 48
Batch QR-DQN (4X) 119% 45
Online C51 119% 44
Online QR-DQN 135% 53

8.7 Learning Curves

All the learning curves in this section show the evaluation performance across 60 Atari games averaged over 5 runs, smoothed over a sliding window of 3 iterations, and error bands show 95% confidence interval. For each game, each individual run for a batch agent use the logged dataset collected from a DQN (or QR-DQN) agent run using a different random seed.

Figure 9: Average returns (scores) for batch DQN, batch QR-DQN, and online DQN on 60 Atari games. All the batch agents are trained for the same amount of gradient steps as the online DQN agent.
Figure 10: Average returns (scores) of DQN, Ensemble-DQN, QR-DQN and REM agents trained using the batch setting on 60 Atari games using logged data from a online DQN agent (i.e., DQN). The horizontal line for online agents show the best evaluation performance they obtain during training.
Figure 11: Average returns (scores) of DQN, Ensemble-DQN, QR-DQN and REM agents trained using the batch setting on 60 Atari games using logged data from a online QR-DQN agent (i.e., QR-DQN). The horizontal line for online agents show the best evaluation performance they obtain during training.