Abstract
Deep networks have enabled reinforcement learning to scale to more complexand challenging domains, but these methods typically require large quantitiesof training data. An alternative is to use sample-efficient episodic controlmethods: neuro-inspired algorithms which use non-/semi-parametric models thatpredict values based on storing and retrieving previously experiencedtransitions. One way to further improve the sample efficiency of theseapproaches is to use more principled exploration strategies. In this work, wetherefore propose maximum entropy mellowmax episodic control (MEMEC), whichsamples actions according to a Boltzmann policy with a state-dependenttemperature. We demonstrate that MEMEC outperforms other uncertainty- andsoftmax-based exploration methods on classic reinforcement learningenvironments and Atari games, achieving both more rapid learning and higherfinal rewards.
Quick Read (beta)
Sample-Efficient Reinforcement Learning with Maximum Entropy Mellowmax Episodic Control
Abstract
Deep networks have enabled reinforcement learning to scale to more complex and challenging domains, but these methods typically require large quantities of training data. An alternative is to use sample-efficient episodic control methods: neuro-inspired algorithms which use non-/semi-parametric models that predict values based on storing and retrieving previously experienced transitions. One way to further improve the sample efficiency of these approaches is to use more principled exploration strategies. In this work, we therefore propose maximum entropy mellowmax episodic control (MEMEC), which samples actions according to a Boltzmann policy with a state-dependent temperature. We demonstrate that MEMEC outperforms other uncertainty- and softmax-based exploration methods on classic reinforcement learning environments and Atari games, achieving both more rapid learning and higher final rewards.
Sample-Efficient Reinforcement Learning with Maximum Entropy Mellowmax Episodic Control
Marta Sarrico Department of Bioengineering Imperial College London [email protected] Kai Arulkumaran Department of Bioengineering Imperial College London [email protected] Andrea Agostinelli Department of Bioengineering Imperial College London [email protected] Pierre Richemond Data Science Institute Imperial College London [email protected] Anil A. Bharath Department of Bioengineering Imperial College London [email protected]
noticebox[b]Workshop on Biological and Artificial Reinforcement Learning, NeurIPS 2019. \[email protected]
1 Introduction
Despite the successes of deep reinforcement learning (DRL) agents Arulkumaran et al. [2017], these models have a sample-efficiency limitation: DRL agents typically require hundreds of times more experience than a human to reach similar levels, suggesting a large gap between current DRL algorithms and the operation of the human brain Yu [2018], Lake et al. [2017]. Recently, new neuro-inspired episodic control (EC) algorithms have demonstrated rapid learning, as compared to state-of-the-art DRL methods Blundell et al. [2016], Pritzel et al. [2017]. These algorithms were inspired by human long-term memory, which can be divided into semantic and episodic memory: the former is responsible for storing general knowledge and facts about the world, whilst the latter is related to recollecting our personal experiences Greenberg and Verfaellie [2010]. EC, introduced by Lengyel & Dayan Lengyel and Dayan [2008], is inspired by this biological episodic memory, and models one of the several different control systems used for behavioural decisions as suggested by neuroscience research Daw et al. [2005]. As opposed to other RL systems, EC enables rapidly learning a policy from sparse amounts of experience.
An alternative factor in the sample-efficiency of RL methods is the existence of an effective exploration policy that is able to collect diverse experiences from the environment to learn from. The deep $Q$-network (DQN) Mnih et al. [2015], a notable DRL algorithm, as well as current EC algorithms Blundell et al. [2016], Pritzel et al. [2017], use the naive $\u03f5$-greedy exploration policy Sutton and Barto [2018]. More principled exploration methods, such as upper confidence bound (UCB) Auer et al. [2002], Chen et al. [2017] and Thompson sampling (TS) Osband et al. [2013], sample actions based on uncertainty over their consequences. Another alternative is to sample from a distribution over actions, such as the Boltzmann distribution Sutton and Barto [2018], performing importance weighting of actions proportionally to the Gibbs-Boltzmann weights of their state-action values. Yet another solution is to use an alternative objective, yielding a policy that balances between maximising both the expected return and its entropy over states Asadi and Littman [2017], Kim et al. [2019]. Similarly to supervised learning settings, the latter approach corresponds to an entropic regularisation over the solution Fox et al. [2016], Nachum et al. [2017], Neu et al. [2017], Richemond and Maginnis [2017].
In this work, we hence propose maximum entropy mellowmax episodic control (MEMEC), which combines EC models with the maximum entropy mellowmax policy Asadi and Littman [2017] for principled exploration. Without resorting to maximum entropy RL, hence decoupling exploration benefits from the objective, we show that this softmax-based exploration strategy can still improve both the sample-efficiency and final returns of EC methods, while retaining their simple, low-bias Monte Carlo returns. We test MEMEC on a wide variety of domains, where it shows improvements over the original EC methods, as well as the same EC methods with alternative exploration policies.
2 Background
Reinforcement Learning: The RL setting is formalised by Markov decision processes (MDPs). MDPs are characterised by a tuple $\u27e8S,A,R,T,\gamma \u27e9$, where $S$ is the set of states, $A$ is the set of actions, $R$ is a reward function which is the immediate, intrinsic desirability of a certain state, $T$ is the transition dynamics, and $\gamma \in [0,1)$ is a discount factor that trades off the value of current and future rewards. The goal of RL is to find the optimal policy, ${\pi}^{*}$, that maximizes the expected cumulative discounted return when followed from any state $s\in S$.
$Q$-learning Watkins and Dayan [1992], a widely used temporal difference (TD) method, can learn value functions by bootstrapping. The DQN uses $Q$-learning, updating a neural-network-based state-action value function, ${Q}^{\pi}(s,a;\theta )$, parameterised by $\theta $. As a baseline across all environments we used a strong DRL method—the dueling double DQN (D3QN) Mnih et al. [2015], Van Hasselt et al. [2016], Wang et al. [2016] (further detailed in Section 7).
Episodic Control: EC models use a memory buffer that stores state-action pairs and their associated episodic returns, $(s,a,R)$, with control performed by replaying the most rewarding actions based on the similarity between the current and stored states. Implementations of these EC methods include the non-parametric model-free EC (MFEC) Blundell et al. [2016], and the semi-parametric neural EC (NEC) Pritzel et al. [2017]. Further details on MFEC and NEC can be found in Figure 1 and Section 7.
3 Exploration Strategies
For both MFEC and NEC, in order to trade off exploration and exploitation in the environment, the agent follows an $\u03f5$-greedy policy. This consists of a sampling strategy where the agent either uniformly picks over all actions with probability $\u03f5$, or instead chooses the action that maximises the predicted return.
Uncertainty-based: UCB and TS are two exploration strategies that sample actions based on predicted uncertainties over the value of the return, and hence require learning a distribution over each $Q$-value. Common practice is to assume a (sub-)Gaussian distribution over $Q$-values, which is both convenient and yields tight regret lower bounds Lai and Robbins [1985].
In UCB, the action that is chosen at each timestep is the one that is the maximum over the sum of the value (estimated mean of the distribution) and its uncertainty $U(s,a)$ (a function of the estimated standard deviation of the distribution): ${\mathrm{arg}\mathrm{max}}_{a\in \mathcal{A}}Q(s,a)+U(s,a)$. TS can be implemented by sampling from a posterior over $Q$-functions: $\stackrel{~}{Q}(s,a)\sim \mathcal{N}(Q(s,a),\stackrel{~}{\sigma}(Q(s,a))\mathit{\hspace{1em}}\forall a\in \mathcal{A}$, where $\stackrel{~}{\sigma}(Q(s,a))$ is the estimated standard deviation of the $Q$-value, and we are assuming independence of state-action value functions. Then, the $\mathrm{arg}\mathrm{max}$ action is chosen greedily from the sampled $Q$-function: ${\mathrm{arg}\mathrm{max}}_{a\in \mathcal{A}}\stackrel{~}{Q}(s,a)$. The difficulty with these methods is that they require learning a probabilistic $Q$-function, and learning well-calibrated uncertainties using neural networks is still a challenging problem Osband [2016]. Furthermore, the actions are sampled independently per timestep, and so the resulting policy can be inconsistent over time Osband et al. [2016].
Softmax-based: An alternative which does not have these disadvantages is to use a softmax-based policy, which only requires learning the expected $Q$-values. By applying the Boltzmann operator (with inverse temperature $\beta $) to the $Q$-values, ${\mathrm{boltz}}_{\beta}(Q)=\frac{{\sum}_{i=1}^{n}{Q}_{i}{e}^{\beta {Q}_{i}}}{{\sum}_{i=1}^{n}{e}^{\beta {Q}_{i}}}$, one can sample actions from the resulting Boltzmann distribution over actions, where more promising actions will be sampled more frequently. However, the Boltzmann operator is not a non-expansion (it is not Lipschitz with constant $$) for all values of the temperature parameter Littman and Szepesvári [1996]; critical to the proof of convergence in TD learning, the non-expansion property is a sufficient condition to guarantee convergence to a unique fixed point. As a solution, Asade & Littman Asadi and Littman [2017] introduced the mellowmax operator (with hyperparameter $\omega $), ${\mathrm{mm}}_{\omega}(Q)=\frac{\mathrm{log}\left(\frac{1}{n}{\sum}_{i=1}^{n}{e}^{\omega {Q}_{i}}\right)}{\omega}$, as an alternative softmax function for value function optimisation, which is a non-expansion in the infinity norm for all temperatures.
As the mellowmax operator can return the maximum (as $\omega \to \mathrm{\infty}$) or minimum (as $\omega \to -\mathrm{\infty}$) of a set of values, Asadi & Littman Asadi and Littman [2017] also proposed optimising $\omega $ under a maximum entropy constraint. This results in the maximum entropy mellowmax policy: ${\pi}_{mm}(a|s)=\frac{{e}^{\beta Q(s,a)}}{{\sum}_{a\in \mathcal{A}}{e}^{\beta Q(s,a)}}$, which is of the same functional form as the Boltzmann policy, but where the optimal $\beta $ can be solved for using a root-finding algorithm, such as Brent’s method Brent [2013]. The equation which is solved for is ${\sum}_{a\in \mathcal{A}}{e}^{\beta \left(Q(s,a)-{\mathrm{mm}}_{\omega}(Q(s,\cdot ))\right)}\left(Q(s,a)-{\mathrm{mm}}_{\omega}(Q(s,\cdot ))\right)=0.$
MEMEC, which uses the maximum entropy mellowmax policy for exploration, hence benefits from softmax-based exploration, where actions are chosen as a function of their estimated value, with a state-dependent temperature on the Boltzmann distribution over actions. As we show in our experiments, this approach outperforms both the original EC methods with $\u03f5$-greedy exploration, as well as the alternative exploration methods outlined here. We do not train on a maximum entropy objective, thereby purely using the maximum entropy mellowmax policy for exploration.
4 Experiments
Environments: We evaluated EC methods with different exploration strategies, as well as a D3QN baseline, in three sets of domains. Firstly, CartPole and Acrobot, two classic RL problems, implemented in OpenAI Gym Brockman et al. [2016]. Secondly, OpenRoom, and FourRoom, gridworld domains re-implemented from Machado et al. Machado et al. [2017]. For these domains we trained the agents for 100,000 steps, and evaluated them every 500 steps. Thirdly, Pong, Space Invaders, Q*bert, Bowling, and Ms. Pac-Man, video games with high-dimensional visual observations from the Atari Learning Environment Bellemare et al. [2013]. Due to computational resource limitations, Atari training was performed for 5,000,000 steps (20,000,000 frames) for MFEC and 3,500,000 steps (12,000,000 frames) for NEC; evaluation was performed every 100,000 steps. For Atari games, we use standard preprocessing for the state Mnih et al. [2015], and similarly repeat actions for 4 game frames (= 1 agent step). We used Gaussian random projections for MFEC, as this obtained better results than the variational autoencoder embeddings in the original work Blundell et al. [2016]. For all experiments we report the mean and standard deviation of each method, calculated over three random seeds.
Hyperparameters: For all EC methods, we primarily used the original hyperparameters Blundell et al. [2016], Pritzel et al. [2017], but chose to keep some consistent across MFEC and NEC for consistency. These were to use the more robust inverse distance weighted kernel (Equation 5) for MFEC and NEC, $k$ = 11 nearest neighbours, and a discount factor $\gamma =$ 0.99 across all domains. For the episodic memories, the buffer size was set to 150 for the room domains, 10,000 for the classic control domains and 100,000 (due to computational resource limitations) for the Atari games. We used the original hyperparameters for D3QN on Atari games Wang et al. [2016], and tuned them manually for the other domains. The full details are present in Section 11.
$\omega $ Hyperparameter: The $\omega $ hyperparameter in the mellowmax operator requires tuning per domain Asadi and Littman [2017], Kim et al. [2019]. With high $\omega $, $m{m}_{\omega}$ acts as a max operator and with $\omega $ approaching zero $m{m}_{\omega}$ acts as a mean operator. Thus, $\omega $ should not be too large, as the agent will act greedily, nor too small, as the agent will behave randomly. To tune this hyperparameter we implemented a grid search with the following $\omega $ values: for the Gridworld and classic Control environments, $\omega \in \{5,7,9,12\}$; for Atari domain, $\omega \in \{10,20,25,30,40,50,60\}$, depending on the game. As in prior work Asadi and Littman [2017], Kim et al. [2019], we show results for MEMEC with the best $\omega $. We set $\omega $ to 7.5 for all non-Atari domains, 25 for Pong, 40 for Q*Bert and Space Invaders, 50 for Bowling and 60 for Ms. Pac-Man.
Uncertainty-based Exploration: For the five simpler domains, we tested $\u03f5$-greedy, UCB, TS, Boltzmann and maximum entropy mellowmax exploration strategies. While UCB sometimes performed well, TS performed extremely badly. Upon closer examination, the uncertainty estimates over the $Q$-values as calculated via the covariance matrix defined over the inverse distance kernel Pritzel et al. [2017] embeddings over the nearest neighbour keys were both relatively constant, meaning that UCB was close to $\u03f5$-greedy with small $\u03f5$, and relatively large, causing TS’ poor performance (high variance samples for the $Q$-values). We were unable to improve the uncertainty estimates by fitting a Gaussian process to the nearest neighbours and optimising over $\delta $ in the inverse distance weighted kernel. Theoretically, UCB and TS often have similar frequentist regret bounds Lattimore and Szepesvári [2018], but our experiments demonstrate how dependent these are on well-calibrated uncertainties.^{1}^{1} 1 Despite this, we have retained the results for UCB in any experiments conducted with this method. Due to computational limitations, and as a result of these experiments on the simpler domains, we only evaluated $\u03f5$-greedy and mellowmax, as well as the D3QN baseline, on the Atari games.
5 Results
Classic Control: All five methods solved Cartpole (see Section 8), as it is a relatively simple environment where even purely random exploration can reach highly rewarding states. In Figure 2 we show the methods’ performance in Acrobot, using both MFEC and NEC. In both cases, MEMEC outperformed the other exploration strategies, not only getting to higher final rewards, but starting to learn faster. The next best strategy—the Boltzmann policy—had high variance over random seeds. As discussed previously in Section 4, UCB performed poorly, which we believe is due to poorly calibrated uncertainties. The D3QN baseline learned more slowly, and converged to a suboptimal policy.
Gridworld: MFEC performed best with softmax-based exploration strategies, whilst nearly all methods tended to perform poorly with NEC (see Section 9). Whilst MFEC with $\u03f5$-greedy was able to solve OpenRoom, it had high variance for FourRoom. In contrast to most of our other results, NEC performed best with $\u03f5$-greedy, but still had poor performance on FourRoom. D3QN was not able to consistently solve OpenRoom, and failed to solve FourRoom.
Environments | MFEC | D3QN | NEC | D3QN | ||
---|---|---|---|---|---|---|
$\u03f5$-greedy | Mellowmax | $\u03f5$-greedy | Mellowmax | |||
Bowling | $62\pm .8$ | $\mathrm{\U0001d7d4\U0001d7d6}\mathbf{\pm}\mathbf{.7}$ | $26\pm 12$ | $11\pm 9$ | $9\pm 7$ | $\mathrm{\U0001d7d0\U0001d7d0}\mathbf{\pm}\mathrm{\U0001d7cf\U0001d7d4}$ |
Q*Bert | $3896\pm 710$ | $\mathrm{\U0001d7cf\U0001d7cf\U0001d7d4\U0001d7cf\U0001d7ce}\mathbf{\pm}\mathrm{\U0001d7d6\U0001d7d7\U0001d7d6}$ | $3743\pm 1100$ | $3951\pm 1321$ | $\mathrm{\U0001d7d3\U0001d7d4\U0001d7d3\U0001d7d2}\mathbf{\pm}\mathrm{\U0001d7d2\U0001d7d6\U0001d7d1}$ | $1480\pm 271$ |
Ms. Pac-Man | $4178\pm 510$ | $\mathrm{\U0001d7d4\U0001d7d7\U0001d7d4\U0001d7d6}\mathbf{\pm}\mathrm{\U0001d7d5\U0001d7d6\U0001d7d5}$ | $2101\pm 56$ | $3900\pm 852$ | $\mathrm{\U0001d7d4\U0001d7d7\U0001d7d7\U0001d7d5}\mathbf{\pm}\mathrm{\U0001d7d0\U0001d7d3\U0001d7d4}$ | $1851\pm 98$ |
Space Invaders | $672\pm 13$ | $\mathrm{\U0001d7cf\U0001d7ce\U0001d7d0\U0001d7d7}\mathbf{\pm}\mathrm{\U0001d7cf\U0001d7d3\U0001d7d5}$ | $737\pm 29$ | $598\pm 110$ | $\mathrm{\U0001d7d7\U0001d7cf\U0001d7d4}\mathbf{\pm}\mathrm{\U0001d7d0\U0001d7d0\U0001d7d6}$ | $756\pm 30$ |
Pong | $\mathrm{\U0001d7cf\U0001d7d5}\mathbf{\pm}\mathrm{\U0001d7d0}$ | $7\pm 4$ | $6\pm 4$ | $-7\pm .9$ | $-11\pm .5$ | $\mathbf{-}\mathrm{\U0001d7d3}\mathbf{\pm}\mathrm{\U0001d7d4}$ |
Atari: Figure 3 and Table 1 show the performance of our method compared to the D3QN baseline and to the $\u03f5$-greedy policy with EC for five different Atari games. Overall, MEMEC outperformed the other methods in these games most of the time, not only in terms of the maximum achieved reward, but also in terms of the learning speed (see Section 10 for additional learning curves). For Space Invaders, Q*Bert and Ms. Pac-Man, our MEMEC agents, for both EC methods, outperformed the other algorithms, showing more rapid learning and higher final scores. For Bowling, MFEC-based MEMEC performed similarly to $\u03f5$-greedy, and somewhat worse in Pong. NEC-based agents—either with $\u03f5$-greedy or mellowmax policies—failed to solve Bowling and Pong. We note that due to computational limitations we were unable to use the very large buffer sizes from the original works, which can also explain the inability to reproduce the performance of $\u03f5$-greedy EC; it remains to be seen if using larger buffer sizes would similarly improve the performance of MEMEC.
6 Discussion
In this work we have investigated the use of more principled exploration methods in combination with EC, and proposed MEMEC, a method that addresses one of the main limitations of state-of-the-art RL models, namely, sample efficiency. We show over a range of domains, including Atari games, that MEMEC can reliably achieve higher rewards faster, with stable performance across seeds.
One limitation of mellowmax-based policies is its sensitivity to the value of $\omega $ across different domains Asadi and Littman [2017], Kim et al. [2019], and the subsequent searches used to find optimal values, which are prohibitive in domains such as Atari. Implementing methods to automatically determine and tune $\omega $ will be an important area for future work.
References
- Arulkumaran et al. [2017] Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6):26–38, 2017.
- Asadi and Littman [2017] Kavosh Asadi and Michael L Littman. An alternative softmax operator for reinforcement learning. In International Conference on Machine Learning, pages 243–252, 2017.
- Auer et al. [2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002.
- Bellemare et al. [2013] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Blundell et al. [2016] Charles Blundell, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z Leibo, Jack Rae, Daan Wierstra, and Demis Hassabis. Model-free episodic control. arXiv preprint arXiv:1606.04460, 2016.
- Brent [2013] Richard P Brent. Algorithms for minimization without derivatives. Courier Corporation, 2013.
- Brockman et al. [2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
- Chen et al. [2017] Richard Y Chen, Szymon Sidor, Pieter Abbeel, and John Schulman. UCB exploration via Q-ensembles. arXiv preprint arXiv:1706.01502, 2017.
- Daw et al. [2005] Nathaniel D Daw, Yael Niv, and Peter Dayan. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience, 8(12):1704, 2005.
- Fox et al. [2016] Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. In Conference on Uncertainty in Artificial Intelligence, 2016.
- Greenberg and Verfaellie [2010] Daniel L Greenberg and Mieke Verfaellie. Interdependence of episodic and semantic memory: evidence from neuropsychology. Journal of the International Neuropsychological Society, 16(5):748–753, 2010.
- Hasselt [2010] Hado V Hasselt. Double Q-learning. In Advances in Neural Information Processing Systems, pages 2613–2621, 2010.
- Kim et al. [2019] Seungchan Kim, Kavosh Asadi, Michael Littman, and George Konidaris. DeepMellow: Removing the Need for a Target Network in Deep Q-Learning. In International Joint Conference on Artificial Intelligence, 2019.
- Lai and Robbins [1985] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
- Lake et al. [2017] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40, 2017.
- Lattimore and Szepesvári [2018] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2018.
- Lengyel and Dayan [2008] Máté Lengyel and Peter Dayan. Hippocampal contributions to control: the third way. In Advances in Neural Information Processing Systems, pages 889–896, 2008.
- Lin [1992] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3-4):293–321, 1992.
- Littman and Szepesvári [1996] Michael L Littman and Csaba Szepesvári. A generalized reinforcement-learning model: Convergence and applications. In International Conference on Machine Learning, volume 96, pages 310–318, 1996.
- Machado et al. [2017] Marios C Machado, Marc G Bellemare, and Michael Bowling. A Laplacian framework for option discovery in reinforcement learning. In International Conference on Machine Learning, pages 2295–2304, 2017.
- Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Nachum et al. [2017] Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pages 2775–2785, 2017.
- Neu et al. [2017] Gergely Neu, Anders Jonsson, and Vicenç Gómez. A unified view of entropy-regularized Markov decision processes. arXiv preprint arXiv:1705.07798, 2017.
- Osband [2016] Ian Osband. Risk versus uncertainty in deep learning: Bayes, bootstrap and the dangers of dropout. In NIPS Workshop on Bayesian Deep Learning, 2016.
- Osband et al. [2013] Ian Osband, Daniel Russo, and Benjamin Van Roy. (More) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.
- Osband et al. [2016] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN. In Advances in Neural Information Processing Systems, pages 4026–4034, 2016.
- Pritzel et al. [2017] Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adria Puigdomenech Badia, Oriol Vinyals, Demis Hassabis, Daan Wierstra, and Charles Blundell. Neural episodic control. In International Conference on Machine Learning, pages 2827–2836, 2017.
- Richemond and Maginnis [2017] Pierre H Richemond and Brendan Maginnis. A short variational proof of equivalence between policy gradients and soft Q-learning. arXiv preprint arXiv:1712.08650, 2017.
- Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
- Van Hasselt et al. [2016] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double Q-learning. In AAAI Conference on Artificial Intelligence, 2016.
- Wang et al. [2016] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning, pages 1995–2003, 2016.
- Watkins and Dayan [1992] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992.
- Yu [2018] Yang Yu. Towards sample efficient reinforcement learning. In International Joint Conference on Artificial Intelligence, pages 5739–5743, 2018.
7 Additional Background
Deep Q-network: The DQN uses $Q$-learning, updating a neural-network based state-action value function, $Q(s,a;\theta )$, parameterised by $\theta $.
The deep Q-network receives the state $s$ as input and outputs a vector with the state-action values $Q(s,a,\theta )\mathit{\hspace{1em}}\forall a\in \mathcal{A}$. In order to improve the stability of RL with function approximators, the original authors proposed using experience replay Lin [1992] and a target network Mnih et al. [2015]. The target network, parameterised by ${\theta}^{-}$, is equal to an older version of the online network and it is updated every $\tau $ steps. The TD target for the DQN is defined by:
$${y}_{t}^{DQN}={r}_{t+1}+\gamma \underset{a}{\mathrm{max}}Q({s}_{t+1},a;{\theta}_{t}^{-}).$$ | (1) |
The experience replay is a cyclic buffer that stores the $({s}_{t},{a}_{t},{r}_{t+1},{s}_{t+1})$ transitions that the agent observes. Samples are retrieved from it uniformly at regular intervals to train the network.
There have been several proposed extensions that improve this algorithm. Firstly, the $\mathrm{max}$ operator present in $Q$-learning is responsible for both the selection and evaluation of state-action values, and hence can result in overoptimistic estimated $Q$-values. In double $Q$-learning Hasselt [2010], an alternative update rule for $Q$-learning, the selection of actions and the evaluation of their values are performed by separate estimators. To do so, two value functions are implemented, parameterised by $\theta $ and ${\theta}^{\prime}$: One is used for the greedy selection of actions whereas the other is used to determine the corresponding value. For the double DQN algorithm, the online and target networks are used as the two estimators Van Hasselt et al. [2016]. In this case, the TD target can be written as:
$${y}_{t}^{DDQN}={r}_{t+1}+\gamma Q({s}_{t+1},\underset{a}{\mathrm{arg}\mathrm{max}}Q({s}_{t+1},a;{\theta}_{t});{\theta}_{t}^{-}).$$ | (2) |
Another extension based on the DQN is the dueling double DQN (D3QN) algorithm Wang et al. [2016]. The dueling architecture extends the standard DQN by calculating separate advantage ${A}_{\theta}$ and value ${V}_{\theta}$ streams, sharing the same convolutional neural network as a base. This imposes the form that state-action values can be calculated as offsets (“advantages”) from the state value. The $Q$-function is then computed as:
$${Q}_{\theta}(s,a)={V}_{\theta}(s)+({A}_{\theta}(s,a)-\frac{1}{|\mathcal{A}|}\sum _{{a}^{\prime}}{A}_{\theta}(s,{a}^{\prime})).$$ | (3) |
Model-free episodic control: In MFEC, the episodic controller ${Q}^{EC}(s,a)$ is represented by a fixed-size table for each action. Each entry corresponds to the state-action pair and the highest return ever obtained from taking action $a$ from state $s$. Whenever a new state is encountered, the return is estimated based on the average of the observed returns from the $k$-nearest states ${s}^{(i)}$:
$$\widehat{{Q}^{\mathrm{EC}}}(s,a)=\{\begin{array}{cc}\frac{1}{k}\sum _{i=1}^{k}{Q}^{\mathrm{EC}}({s}^{(i)},a)\hfill & \text{if}(s,a)\notin {Q}^{\mathrm{EC}},\hfill \\ {Q}^{\mathrm{EC}}(s,a)\hfill & \text{otherwise}.\hfill \end{array}$$ | (4) |
The state-action value function ${Q}^{EC}(s,a)$ is updated at the end of each episode, replacing the least-recently-used transitions with new state-action pairs and their corresponding episodic returns. If the pair $(s,a)$ already exists in the table, the stored value will be updated to the maximum of the stored return and the newly observed return.
For high-dimensional state spaces, the original authors used Gaussian random projections in order to reduce the dimensionality of the states before using them in the episodic controller.
Neural Episodic Control: NEC consists of two components: a neural network for learning a feature mapping, and a differentiable neural dictionary (DND) per action (see Figure 1). The neural network receives as input the state $s$ and outputs a key/embedding $h$. Each DND memory performs a lookup based on the current key $h$, by comparing it with the keys ${h}_{i}$ already stored in the DND, and retrieving the $k$ most similar keys with their corresponding ${Q}_{i}$ values. The predicted $Q(s,a)$ value, based on $h$, is the weighted sum of the retrieved ${Q}_{i}$ values: ${\sum}_{i}{w}_{i}{Q}_{i}$. Each weight ${w}_{i}$ is calculated by using the inverse distance weighted kernel:
$$k(h,{h}_{i})=\frac{1}{{\parallel h-{h}_{i}\parallel}_{2}^{2}+\delta}.$$ | (5) |
Updating the DND involves appending new key-value pairs to the dictionary or, for cases where the same state already exists in that dictionary, updating the value of $Q(s,a)$ as in $Q$-learning:
$${Q}_{i}\leftarrow {Q}_{i}+\alpha \left({R}^{(n)}(s,a)-{Q}_{i}\right),$$ | (6) |
where $\alpha $ is the learning rate, and ${R}^{(n)}$ is the $n$-step return Sutton and Barto [2018] bootstrapped from the predicted $Q$-values from NEC during each episode.
The parameters of the neural network are trained by minimising the mean squared error loss between the predicted $Q$-values and the stored returns, using a random sample of $({s}_{t},{a}_{t},{R}_{t}^{(n)})$ tuples stored in experience replay Lin [1992], in a similar fashion to the DQN algorithm.
8 Cartpole Results
9 GridWorld Results
10 Atari Results
11 Hyperparameters
Parameters name | D3QN | MFEC | NEC |
---|---|---|---|
$\u03f5$ initial | $1$ | $1$ | $1$ |
$\u03f5$ final | $5\times {10}^{-3}$ | $5\times {10}^{-3}$ | $5\times {10}^{-3}$ |
$\u03f5$ anneal start (steps) | $1$ | $5\times {10}^{3}$ | $5\times {10}^{3}$ |
$\u03f5$ anneal end (steps) | $5\times {10}^{4}$ | $25\times {10}^{3}$ | $25\times {10}^{3}$ |
Discount factor $\lambda $ | $0.99$ | $0.99$ | $0.99$ |
Reward clip | Yes | No | No |
Number of neighbours $k$ | — | $11$ | $11$ |
Kernel delta $\delta $ | — | ${10}^{-3}$ | ${10}^{-3}$ |
Experience replay size | ${10}^{5}$ | — | ${10}^{5}$ |
RMSprop learning rate | $25\times {10}^{-5}$ | — | $7.92\times {10}^{-6}$ |
RMSprop momentum | $0.95$ | — | $0.95$ |
RMSprop $\u03f5$ | ${10}^{-2}$ | — | ${10}^{-2}$ |
Training start (steps) | $5\times {10}^{3}$ | — | ${10}^{3}$ |
Batch size | $32$ | — | $32$ |
Target network update (steps) | $7.5\times {10}^{3}$ | — | — |
Observation projection | — | None | — |
Projection key size | — | None | — |
Memory learning rate $\alpha $ | — | — | $0.1$ |
$n$-step return | — | — | $100$ |
Key size | — | — | $64$ |
Parameters name | D3QN | MFEC | NEC |
---|---|---|---|
$\u03f5$ initial | $1$ | $1$ | $1$ |
$\u03f5$ final | ${10}^{-2}$ | $5\times {10}^{-3}$ | ${10}^{-3}$ |
$\u03f5$ anneal start (steps) | $1$ | $5\times {10}^{3}$ | $5\times {10}^{3}$ |
$\u03f5$ anneal end (steps) | ${10}^{6}$ | $25\times {10}^{3}$ | $25\times {10}^{3}$ |
Discount factor $\lambda $ | $0.99$ | $0.99$ | $0.99$ |
Reward clip | Yes | No | No |
Number of neighbours $k$ | — | $11$ | $11$ |
Kernel delta $\delta $ | — | ${10}^{-3}$ | ${10}^{-3}$ |
Experience replay size | ${10}^{6}$ | — | ${10}^{5}$ |
RMSprop learning rate | $25\times {10}^{-5}$ | — | $7.92\times {10}^{-6}$ |
RMSprop momentum | $0.95$ | — | $0.95$ |
RMSprop $\u03f5$ | ${10}^{-2}$ | — | ${10}^{-2}$ |
Training start (steps) | $12.5\times {10}^{3}$ | — | $5\times {10}^{4}$ |
Batch size | $32$ | — | $32$ |
Target network update (steps) | ${10}^{3}$ | — | — |
Observation projection | — | Gaussian | — |
Projection key size | — | 128 | — |
Memory learning rate $\alpha $ | — | — | $0.1$ |
$n$-step return | — | — | $100$ |
Key size | — | — | $128$ |