Abstract
Successfully navigating a complex environment to obtain a desired outcome isa difficult task, that up to recently was believed to be capable only byhumans. This perception has been broken down over time, especially with theintroduction of deep reinforcement learning, which has greatly increased thedifficulty of tasks that can be automated. However, for traditionalreinforcement learning agents this requires an environment to be able toprovide frequent extrinsic rewards, which are not known or accessible for manyrealworld environments. This project aims to explore and contrast existingreinforcement learning solutions that circumnavigate the difficulties of anenvironment that provide sparse rewards. Different reinforcement solutions willbe implemented over a several video game environments with varying difficultyand varying frequency of rewards, as to properly investigate the applicabilityof these solutions. This project introduces a novel reinforcement learningsolution by combining aspects of two existing state of the art sparse rewardsolutions, curiosity driven exploration and unsupervised auxiliary tasks.
Quick Read (beta)
University of Sheffield
[1.5cm] Dealing with Sparse Rewards in Reinforcement Learning
[1.5cm]
[1cm] Joshua Hare
[1cm] Supervisor: Eleni Vasilaki
[1cm] A report submitted in fulfilment of the requirements
for the degree of MSc in Advanced Computer Science
[0.3cm] in the
[0.3cm] Department of Computer Science
[2cm] September 11, 2019
Declaration
\setstretch1.1
All sentences or passages quoted in this report from other people’s work have been specifically acknowledged by clear crossreferencing to author, work and page(s). Any illustrations that are not the work of the author of this report have been used with the explicit permission of the originator and are specifically acknowledged. I understand that failure to do this amounts to plagiarism and will be considered grounds for failure in this project and the degree examination as a whole.
Name: Joshua Hare
Signature: Joshua Hare
Date: 15/05/2019
Abstract
Successfully navigating a complex environment to obtain a desired outcome is a difficult task, that up to recently was believed to be capable only by humans. This perception has been broken down over time, especially with the introduction of deep reinforcement learning, which has greatly increased the difficulty of tasks that can be automated. However, for traditional reinforcement learning agents this requires an environment to be able to provide frequent extrinsic rewards, which are not known or accessible for many realworld environments. This project aims to explore and contrast existing reinforcement learning solutions that circumnavigate the difficulties of an environment that provide sparse rewards. Different reinforcement solutions will be implemented over a several video game environments with varying difficulty and varying frequency of rewards, as to properly investigate the applicability of these solutions. This project introduces a novel reinforcement learning solution by combining aspects of two existing state of the art sparse reward solutions, curiosity driven exploration and unsupervised auxiliary tasks.
Contents
 1 Introduction
 2 Reinforcement Learning Background
 3 Neural Networks
 4 Deep Reinforcement Learning
 5 Evaluation and Testing
 6 Implementation and Analysis
 7 Results and Discussion
 8 Conclusions and Further Work
List of Figures
 2.1 Greedy policy iteration for stateaction value function
 3.1 Visual representation of stochastic gradient optimisers
 4.1 Deep Q Learning algorithm
 4.2 Advantage ActorCritic (A2C) algorithm
 4.3 Proximal Policy Optimisation (PPO) algorithm
 4.4 Overview of the UNREAL agent
 7.1 Atari baselines for A2C vs SyncDDQN
 7.2 Atari experiments for PPOGAE vs A2CGAE and SyncDDQN($\lambda $)
 7.3 Atari experiments for RND vs UNREALA2C2 vs RANDAL agents
 7.4 Classic Control A2C Adam hyperparameter search
 7.5 Classic Control SyncDDQN Adam hyperparameter search
 7.6 Classic Control baselines Adam vs RMSProp for A2C and SyncDQQN agents
 7.7 Classic Control A2CGAE vs SyncDDQN($\lambda $) vs PPO
 7.8 Classic Control experiments for RND vs UNREALA2C2 vs RANDAL agents
[title=Summary of Notation]
1.1
Chapter 1 Introduction
Reinforcement learning is a multidisciplinary field combining aspects from psychology, neuroscience, mathematics and computer science, where an agent learns to interact with a environment by taking actions and receiving rewards. This is inspired from observations of chemical reward signals found in brains of humans and many other animal [1], that dictate the behaviours of these animals. Video games have been a common benchmark for reinforcement learning (RL) throughout the existence of RL. From Samuel’s checkers player [2] in 1959 to the game of Go [3], Atari2600 games [4], to most recently Deepmind’s Starcraft II [5], which has over ${10}^{8}$ possible available actions that the agent can take. The preference for Video games in RL, is easily understandable, as they provide an enclosed simulatory environment, often with easily accessible, quantifiable rewards that may not be available or difficult to measure in a real world environment. As it is common for many RL algorithms to take millions of training examples to effectively learn to interact with its environment, video games are useful in that they are computational simulations. This means that the environment is able to run faster than realtime, and many simulations can be performed simultaneously.
However, many modern approaches to RL such as Deep Q learning [6] require, as stated previously, millions of examples in order to optimise due to some fundamental difficulties of the task. One of these is that often RL agents have no prior knowledge of the environment in this case video games, as often video games are modelled on and include real world objects such as ladders, and physics such as gravity. From [7] it was shown that humans perform significantly worse when prior knowledge is masked, and similarly, reinforcement agents perform better when given prior information about the environment.
Another fundamental difficulty of RL is that of a sparse reward signal provided by an environment. A Sparse reward signal is series of rewards produced by the agent interacting with the environment, where most of the rewards received are nonpositive. This makes it extremely difficult for reinforcement learning algorithms to connect a long series of actions to a distant future reward. For extremely sparse rewards, the agent may never find a reward signal, and thus will never learn how to perform a given task. Human agents are not hindered by sparse reward signals and in fact are capable of achieving tasks over a entire lifetime with little to no reward through intrinsic motivation [8], traditional RL algorithm’s learning is severely inhibited by sparse rewards.
This means that in order to automate a specific tasks with traditional reinforcement learning, the rewards received by the agent have to be carefully designed in order to maximise optimal behaviour. This is a problem as it requires that researchers and developers take time to fully understand the dynamics or optimal behaviour for an environment a priori. This takes time to implement, or worse the dynamics or optimal behaviour may unknown to the designers, so the rewards given to the agent may provide suboptimal or nonfunctioning behaviour. By having an agent that can learn from sparse rewards, it allows designers to provide more abstract and long term goals, as well as being able tackle more complex tasks.
This project seeks to explore and contrast solutions to the inhibiting effect sparse rewards on RL agents, their implementations can be found on the following GitHub repository
https://github.com/jhare96/reinforcementlearning/.
1.1 Aims and Objectives
In this project the overall aims are:

•
Use reinforcement learning to successfully implement a working agent capable of playing video games

•
Explore and contrast existing reinforcement methods to deal with sparse reward signals

•
Implement existing solutions for sparse reward reinforcement learning on a series of increasing difficulty tasks
1.2 Overview of the Report
This report begins with a literature survey starting with chapter 2, which is a detailed introduction to reinforcement learning, specifically with the use of Markov Decision Processes. This chapter introduces some key terminology for RL and also introduces the mathematical framework used to evaluate and create RL algorithms as well the algorithms themselves. The literature survey continues on to introduce neural networks in chapter 3, finally a final chapter 4 on deep neural networks use in RL, including state of the art neural network implementations related to solving RL with sparse rewards.
Chapter 5 contains a brief description of the environments used to evaluate the agents implemented in this project. Following this is chapter 6 dedicated to a detailed explanation on the implementations of the RL agents tested in this project, as well as preliminary analysis on their expected performance. The results of the experiments performed are presented in chapter 7 and finally the conclusions of the project are found in chapter 8
Chapter 2 Reinforcement Learning Background
2.1 Markov Chains
A Markov chain is a probabilistic model that connects are series of states $s$ in the set $\mathcal{S}$ via transition probabilities, i.e. moving from one state to another [9]. A state is some representation of a model, this could be a known behaviour or variable, or an unobserved or partially observed hidden representation of a system (Hidden Markov Model). The transition probabilities between the states in the Markov Chain hold the Markov property, that is the transition between the next state of a system ${s}_{t+1}$ is only dependent on the current state ${s}_{t}$ irrespective of the previous state ${s}_{t1}$ [10].
$$p({s}_{t+1}{s}_{0},{s}_{1},\mathrm{\dots},{s}_{t})=p({s}_{t+1}{s}_{t})$$  (2.1) 
2.2 Discrete Markov Decision Processes
A Discrete Markov Decision Process (MDP) is a decision making process characterised by a discrete state and time Markov chain. In a MDP the set of states $\mathcal{S}$ are a representation of the environment e.g. game piece configuration on a board game [11]. In an MDP at each time step a decision is made by taking an action ${a}_{t}$ from a set of actions available in state $s$, ${a}_{t}\in {\mathcal{A}}_{s}$. Taking the action ${a}_{t}$ produces a state transition modelled by the probability distribution ${\mathcal{P}}_{s{s}^{\prime}}^{a}=P({s}_{t+1}={s}^{\prime}{s}_{t}=s,{a}_{t}=a)$ [12], following each state transition a real reward is received ${r}_{t}$ ^{1}^{1} 1 The reward after transitioning from state ${s}_{t}$ to ${s}_{t+1}$ is sometimes denoted as ${r}_{t+1}$ [12]. The reasoning for the choice of ${r}_{t}$ is to be consistent with the modern deep reinforcement learning papers [6],[13],[14],[15], which use this notation. $\in \mathcal{R}$ characterised by the reward model ${r}_{t}={\mathcal{R}}_{s}^{a}=({r}_{t}{s}_{t}=s,{a}_{t}=a)$.
2.3 Reinforcement Learning in Markov Decision Processes
Reinforcement learning is an optimisation process in which an agent learns a policy $\pi $ (a strategy in which determines what decisions will be made), by interacting with an environment such that it maximises the expected future reward. For reinforcement learning in an MDP the optimisation process is to learn the optimal policy $\pi \approx {\pi}^{*}$ which maximises the cumulative expected future reward [12], known as the return^{2}^{2} 2 Also referred to as ${G}_{t}$ in [12] $R$. The return at time $t$, ${R}_{t}$ is the expected future reward from state ${s}_{t}$ onwards indefinitely for continuous tasks or until the final state ${s}_{T}$ for episodic tasks.
$${R}_{t}=\sum _{k=0}^{\mathrm{\infty}}{r}_{t+k}$$  (2.2) 
A policy $\pi $ maps the state $s$ onto a probability distribution over the available actions in ${\mathcal{A}}_{s}$ [12]. The probability for each action $a\in {\mathcal{A}}_{s}$ is denoted by the term $\pi (as)$. Value functions provide an estimation of the value of being in state $s$ or the value of taking an action $a$ in $s$ under a certain policy $\pi $. Commonly used in reinforcement learning are the value functions ${V}^{\pi}$ and ${Q}^{\pi}$, which denote the value and stateaction value functions respectively.
The value function for a given state $s$ under $\pi $ is given by the expected future reward of from starting at $s$ and following the policy $\pi $ until the terminal state. Often a discount factor $\gamma \in [0,1]$ is used dampen the reward signal from distant future actions so that the agent prioritises more immediate rewards (with the idea that immediate rewards are better than distant ones) and to deal with any uncertainty in the model [16]. Similarly the stateaction value function is given by the expected future reward starting from $s$ taking action $a$ then following $\pi $ onwards. The value function and stateaction value function are given by:
${V}^{\pi}(s)$  $={\mathbb{E}}_{\pi}[{R}_{t}{s}_{t}=s]={\mathbb{E}}_{\pi}\left[{\displaystyle \sum _{k=0}^{\mathrm{\infty}}}{\gamma}^{k}{r}_{t+k}\right{s}_{t}=s]$  (2.3)  
${Q}^{\pi}(s,a)$  $={\mathbb{E}}_{\pi}[{R}_{t}{s}_{t}=s,{a}_{t}=a]={\mathbb{E}}_{\pi}\left[{\displaystyle \sum _{k=0}^{\mathrm{\infty}}}{\gamma}^{k}{r}_{t+k}\right{s}_{t}=s,{a}_{t}=a]$  (2.4) 
In order to solve reinforcement learning, the optimal policy ${\pi}^{*}$ is to be found, which is to say to find the strategy which yields the highest reward over all the states. A policy $\pi $ is considered better than another policy ${\pi}^{\prime}$ if it’s value function is greater over all states [12], ${V}^{\pi}(s)>{V}^{{\pi}^{\prime}}(s)\forall s\in S$. Therefore, this can be extended to define the optimal policy:
${\pi}^{*}=\mathrm{arg}\underset{\pi}{\mathrm{max}}{V}^{\pi}(s)\forall s\in S$  (2.5) 
Since ${V}^{*}$ is the maximum expected reward from starting from state $s$ then it must be the maximum of the expected reward over all possible available actions [12] ${V}^{*}(s)={\mathrm{max}}_{a\in {\mathcal{A}}_{s}}{Q}^{*}(s,a)$. Hence, the optimal policy can be written in terms of the optimal stateaction value function:
${\pi}^{*}=\mathrm{arg}\underset{a}{\mathrm{max}}{Q}^{*}(s,a)\forall s\in S$  (2.6) 
2.4 Dynamic Programming
Dynamic programming is technique developed by Bellman to recursively solve a structural search problem (in this case a the optimal policy for a MDP), by storing the values of computed variables that reoccur during the recursive process [17]. From equations (2.3) & (2.4) it can be seen that the cumulative expected reward ${R}_{t}={\sum}_{k=0}^{\mathrm{\infty}}{r}_{t+k}$ is calculated at each time step $t$, which can be computationally demanding for large MDPs.
Using ${\gamma}^{0}=1$ (2.3) can be expanded [18]:
${V}^{\pi}(s)$  $={\mathbb{E}}_{\pi}[{r}_{t}+\gamma {r}_{t+1}+{\gamma}^{2}{r}_{t+2}+\mathrm{\dots}{\gamma}^{K}{r}_{t+K}{s}_{t}=s]\text{Using Definition of}{R}_{t}\text{}\text{2.2}$  
$={\mathbb{E}}_{\pi}[{r}_{t}+\gamma {R}_{t+1}{s}_{t}=s]$  (2.7)  
conditioning on the next state ${s}^{\prime}$ using law of total expectation  
$={\displaystyle \sum _{{s}^{\prime}\in \mathcal{S}}}P({s}_{t+1}={s}^{\prime}{s}_{t}=s){\mathbb{E}}_{\pi}[{r}_{t}+\gamma {R}_{t+1}{s}_{t}=s,{s}_{t+1}={s}^{\prime}]$  
then condition on the action $a$  
$\sum _{a}}P(as){\displaystyle \sum _{{s}^{\prime}\in \mathcal{S}}}P({s}^{\prime}s,a){\mathbb{E}}_{\pi}[{r}_{t}+\gamma {R}_{t+1}s,{s}^{\prime},a]$  
then using addition theorem of expectations  
$\sum _{a}}P(as){\displaystyle \sum _{{s}^{\prime}\in \mathcal{S}}}P({s}^{\prime}s,a)[{\mathbb{E}}_{\pi}\left[{r}_{t}\rights,{s}^{\prime},a]+{\mathbb{E}}_{\pi}\left[{R}_{t+1}\rights,{s}^{\prime},a]]$  
using the Markov property and definition of Reward model  
$\sum _{a}}P(as){\displaystyle \sum _{{s}^{\prime}\in \mathcal{S}}}P({s}^{\prime}s,a)[{\mathcal{R}}_{s}^{a}+{\mathbb{E}}_{\pi}\left[{R}_{t+1}\right{s}^{\prime}]]$ 
Thus the value function can be expressed recursively with the reward and transition model:
${V}^{\pi}(s)={\displaystyle \sum _{a}}\pi (as){\displaystyle \sum _{{s}^{\prime}\in \mathcal{S}}}{\mathcal{P}}_{s{s}^{\prime}}^{a}\left[{\mathcal{R}}_{s}^{a}+\gamma {V}^{\pi}({s}^{\prime})\right]$  (2.8) 
A similar derivation can be done conditioning on the next state ${s}^{\prime}$ and next action ${a}^{\prime}$ or from (2.8) using:
${V}^{\pi}(s)={\displaystyle \sum _{a\in A}}\pi (as){Q}^{\pi}(s,a)\u27f9{Q}^{\pi}(s,a)={\displaystyle \sum _{{s}^{\prime}\in \mathcal{S}}}{\mathcal{P}}_{s{s}^{\prime}}^{a}\left[{\mathcal{R}}_{s}^{a}+\gamma {V}^{\pi}({s}^{\prime})\right]$  
${Q}_{\pi}(s,a)={\displaystyle \sum _{{s}^{\prime}\in S}}{\mathcal{P}}_{s{s}^{\prime}}^{a}\left[{\mathcal{R}}_{s}^{a}+\gamma {\displaystyle \sum _{{a}^{\prime}\in \mathcal{A}}}\pi ({a}^{\prime}{s}^{\prime}){Q}_{\pi}({s}^{\prime},{a}^{\prime})\right]$  (2.9) 
These are known as the Bellman equations for stochastic processes.
Dynamic program requires a known MDP in order to estimate the optimal value function hence update the policy. The policy can be updated iteratively by computing the value function of a policy $\pi $ as according to the Bellman equations given above, then updating the policy using (2.5) or (2.6) depending on the value function used.
2.5 Model Free Reinforcement Learning
Rather than learning the optimal policy for a known MDP structure, modelfree reinforcement learning attempts to find the value function of being in state $s$ directly from experience. This is a crucial step for applying reinforcement learning to real world problems, as most environments have an unknown MDP or too large stateaction space to computationally model. The cost of this approach is the agent has a less efficient learning process, i.e. many more samples required to optimally converge.
2.5.1 Monte Carlo
In Monte Carlo reinforcement learning is a model free method which learns the value function for episodic tasks. An episodic task is a sequence of sequential experiences $$ that always have a terminal state i.e. from ${s}_{0},\mathrm{\dots},{s}_{T}$. Most games can be defined as episodic tasks, an example being a game of chess always has a terminal state (a final boardpiece configuration) after the game has finished through a win or draw. The reason Monte Carlo reinforcement learning can only learn from episodic tasks is that it approximates the expectation of future rewards in the value functions (2.8) and (2.9) with an average of ${R}_{t}$ for all states visited, over each episode. The future reward ${R}_{t}$ can be determined online with dynamic programming however, the average state count can only be determined at the end of the episode. The average will tend to the true expectation as the number of samples approaches infinity due to the central limit theorem. Two common ways the average is calculated in Monte Carlo methods are the first visit evaluation and the every visit evaluation. In the first visit approach, the value function for a state $s$ is calculated only once per sample (episode) and is averaged over all samples. The every visit approach averages the value of the state for every visit including multiple visits per episode [12]. The Value function can be expressed via $V(s)=S(s)/N(s)$, where $S(s)$ is the sum of sample value estimates from state $s$ and $N(s)$ is the number of times the agent has visited state $s$.
The value function can be updated iteratively in Monte Carlo learning, as the mean of a series can be computed iteratively. For a dynamic problem it may be useful to calculate the running mean whilst forgetting older episodes, this is by introducing a dynamic variable $\alpha $ [12]:
$$V\left({s}_{t}\right)\leftarrow V\left({S}_{t}\right)+\alpha \left({R}_{t}V\left({S}_{t}\right)\right)$$  (2.10) 
A disadvantage Monte Carlo method introduces is the inability to learn continuous tasks and the inability to update the value function before the end of an episode. This is not an issue for short time length episodes, but can greatly increase the time taken to find the optimal policy.
2.5.2 Temporal Difference
Temporal Difference combines the model free learning from the Monte Carlo method with the bootstrapping (expected future reward value estimation) from Dynamic Programming. This allows policy learning during episodes or continuous tasks, without the need to model the transition probabilities ${\mathcal{P}}_{s{s}^{\prime}}^{a}$. In Temporal Difference learning, the optimal value function is found by iteratively updating the value function with the temporal error between the value predicted at $V({s}_{t})$ and the future reward ${R}_{t}={\sum}_{k}{\gamma}^{k}{r}_{t+k}$ similar to Monte Carlo evaluation (2.10). This is done by sampling from the expected future reward from (2.7) and using current value function estimate instead of the true value function, as seen in Dynamic Programming [12]. The temporal error is then:
$${\delta}_{t}={r}_{t}+\gamma V({s}_{t+1})V{s}_{t})$$  (2.11) 
$V(s)$ can be iteratively updated using the temporal error ${\delta}_{t}$, this is known as $TD(0)$:
$$V(s)\leftarrow V(s)+\alpha \left[{r}_{t}+\gamma V({s}_{t+1})V({s}_{t})\right]$$  (2.12) 
Using the stateaction value function in (2.12) results in the onpolicy method commonly known as SARSA (state, action, reward, state, action)
$$Q(s,a)\leftarrow Q(s,a)+\alpha \left[{r}_{t}+\gamma Q({s}_{t+1},{a}_{t+1})Q({s}_{t},{a}_{t})\right]$$  (2.13) 
A bridge between Temporal Difference and Monte Carlo methods can be extended first with an $n$step Temporal Difference error TD($n$) which is defined as the temporal error between state $V({s}_{t})$ and $V({s}_{t+n})$ defined using the $n$step future return ${R}_{t}^{(n)}$ [12]:
${R}_{t}^{(n)}={r}_{t}+\gamma {r}_{t+1}+\mathrm{\dots}+{\gamma}^{n1}{r}_{t+n1}+{\gamma}^{n}V\left({s}_{t+n}\right)$  (2.14)  
${\delta}_{t}^{(n)}={R}_{t}^{(n)}V({s}_{t})$  (2.15) 
The optimal number of steps $n$ depends on the length of the task and the number of states, finding a general solution to this problem is done by further extending the bridge to Monte Carlo is the $TD(\lambda )$ approach. This is where all the expected rewards (2.14) are computed from $n=0,\mathrm{\dots}T$ and combined geometrically via [19]:
$${R}_{t}^{\lambda}=(1\lambda )\sum _{n=1}^{\mathrm{\infty}}{\lambda}^{n1}{R}_{t}^{(n)}$$  (2.16) 
The $\lambda $ approach can also be applied to SARSA, SARSA($\lambda $).
Q Learning(an offpolicy version of SARSA) was developed by Watkins [20]. Here instead of approximating the true stateaction value function for a given policy ${Q}^{\pi}(s,a)$ with the current estimate, the optimal stateaction value function is approximated ${Q}^{*}(s,a)$.
$$Q(s,a)\leftarrow Q(s,a)+\alpha \left[{r}_{t}+\gamma \underset{a}{\mathrm{max}}Q({s}_{t+1},a)Q({s}_{t},{a}_{t})\right]$$  (2.17) 
Q Learning suffers from a maximisation bias, that is that it overestimates the values in the stateaction function $Q(s,a)$ i.e. with a positive bias. A solution was proposed in [21] to use two stateaction value functions, one to predict the next action to be taken and another to predict the value estimation of the action. (2.17) then becomes:
${Q}_{A}(s,a)\leftarrow {Q}_{A}(s,a)+\alpha \left[{r}_{t}+\gamma {Q}_{B}({s}^{\prime},\mathrm{arg}\underset{a}{\mathrm{max}}{Q}_{A}({s}^{\prime},a)){Q}_{A}(s,a)\right]$  (2.18)  
${Q}_{B}(s,a)\leftarrow {Q}_{B}(s,a)+\alpha \left[{r}_{t}+\gamma {Q}_{A}({s}^{\prime},\mathrm{arg}\underset{a}{\mathrm{max}}{Q}_{B}({s}^{\prime},a)){Q}_{B}(s,a)\right]$ 
Both ${Q}_{A}(s,a)$ and ${Q}_{B}(s,a)$ are used as the to select the action at time $t$ and at each optimisation step either A or B are updated via random selection.
Q Learning also suffers from poor exploration as the approximation to optimal value is being used as the policy hence often limiting the exploration to states of known rewards. A solution to this is to use a decaying $\epsilon $  greedy policy to enable initial exploration, then use optimal policy once the states have been sufficiently explored.
$$\epsilon greedy\{\begin{array}{cc}{a}_{t}=\text{random}a\in \mathcal{A}\hfill & \text{with probability}\epsilon \hfill \\ {a}_{t}=\underset{a}{\mathrm{max}}{Q}^{*}(s,a)\hfill & \text{with probability}\mathrm{\hspace{0.17em}\hspace{0.17em}1}\epsilon \hfill \end{array}$$  (2.19) 
2.5.3 Policy Gradient
Rather than learning the optimal policy through the maximisation over value function approximations 2.5 or 2.6 (which can be expensive for high dimensional stateactions spaces), the policy can be learnt directly under the condition that the policy is parametrised by a differentiable function. The policy is learnt by calculating the gradient of the current policy estimate $\nabla \pi (as,\theta )$ [12]. This allows us to learn, stochastic policies, continuous action policies and should theoretically provide more stable convergence properties [16] over valuebased methods such as QLearning. However learning the policy directly can reduce the sample efficiency of the agent taking longer to converge to the optimal policy.
The policy gradient theorem shows that maximising the gradient of the value function is equivalent to maximising the gradient of the policy multiplied by the stateaction value function all under the expectation of the policy [12]. This gives the gradient objective function $J(\theta )$ (with the objective of maximising the expected future reward) as
$${\nabla}_{\theta}J(\theta )={\mathbb{E}}_{{\pi}_{\theta}}\left[{Q}^{\pi}(s,a){\nabla}_{\theta}\mathrm{log}\pi (as,\theta )\right]$$  (2.20) 
Using the definition of ${Q}^{\pi}(s,a)$ 2.4
$${\nabla}_{\theta}J(\theta )={\mathbb{E}}_{{\pi}_{\theta}}\left[{R}_{t}{\nabla}_{\theta}\mathrm{log}\pi (as,\theta )\right]$$  (2.21) 
As ${Q}^{\pi}(s,a)$ is not dependent on $\theta $ a baseline metric $b({s}_{t})$ can be subtracted from equations 2.20 & 2.21 as long as it is constant w.r.t. $a$.
$${\nabla}_{\theta}J(\theta )={\mathbb{E}}_{{\pi}_{\theta}}\left[({R}_{t}b(s)){\nabla}_{\theta}\mathrm{log}\pi (as,\theta )\right]$$  (2.22) 
Actor Critic
The ActorCritic family of policy gradient methods involves combining the baseline policy gradient method with the bootstrapping method from temporal difference. For the Advantage ActorCritic the baseline function used is the Advantage function
$$A({s}_{t},{a}_{t})=Q({s}_{t},{a}_{t})V({s}_{t})={r}_{t}+\gamma V({s}_{t+1})V({s}_{t}).$$  (2.23) 
Expressing the Advantage function in terms of only ${V}^{\pi}$ allows for the use of a single set of parameters to define the advantage function and reducing the dimensionality of the function being learnt. This has the benefits of being easier to learn as well as decreasing computational complexity, especially for large $\mathcal{A}$ [22].
${\nabla}_{\theta}J(\theta )$  $={\mathbb{E}}_{{\pi}_{\theta}}\left[({r}_{t}+\gamma V({s}_{t+1})V({s}_{t})){\nabla}_{\theta}\mathrm{log}\pi ({a}_{t}{s}_{t},\theta )\right]\mathit{\hspace{1em}\hspace{1em}}\text{where}{a}_{t}\sim \pi $  (2.24)  
$={\mathbb{E}}_{{\pi}_{\theta}}\left[{\delta}_{t}{\nabla}_{\theta}\mathrm{log}\pi ({a}_{t}{s}_{t},\theta )\right]$ 
By using $V(s)$ as a baseline as well as to predict the TD error, value function provides the estimate of the quality of the action taken by the policy. Hence, it is said to be acting as a critic to the policy (the actor), thus giving the name ActorCritic.
Generalised Advantage Estimation
Just as the TD $n$step returns estimator can be combined in TD($\lambda $), the advantage function estimation can be calculated similarly. Following 2.23, the $n$step advantage estimator can be defined as
$${A}_{t}^{n}=\sum _{l=0}^{n1}{\gamma}^{l}{\delta}_{t+l}=\sum _{l=0}^{n1}{\gamma}^{l}{r}_{t+l}+{\gamma}^{n}V({s}_{t+n})V({s}_{t})$$  (2.25) 
Following the derivation from [22], ${A}_{t}^{\lambda}$ can be derived by first exponentially averaging each $n$step advantage estimate  
${A}_{t}^{\lambda}$  $=(1\lambda )\left({A}_{t}^{(1)}+\lambda {A}_{t}^{(2)}+{\lambda}^{2}{A}_{t}^{(3)}+\mathrm{\dots}\right)$  
${A}_{t}^{\lambda}$  $=(1\lambda )\left({\delta}_{t}+\lambda ({\delta}_{t}+\gamma {\delta}_{t+1})+{\lambda}^{2}({\delta}_{t}+\gamma {\delta}_{t+1}+{\gamma}^{2}{\delta}_{t+2})+\mathrm{\dots}\right)$  
Factoring by ${\delta}_{t}$  
${A}_{t}^{\lambda}$  $=(1\lambda )\left({\delta}_{t}(\lambda +{\lambda}^{2}+\mathrm{\dots})+\gamma {\delta}_{t+1}({\lambda}^{2}+{\lambda}^{3}+\mathrm{\dots})+\mathrm{\dots}\right)$  
Using the geometric series solution $\sum _{k=0}^{n1}}a{r}^{k}=a\left({\displaystyle \frac{1{r}^{n}}{1r}}\right)$  
${A}_{t}^{\lambda}$  $=(1\lambda )\left({\delta}_{t}\left({\displaystyle \frac{1}{1\lambda}}\right)+\gamma {\delta}_{t+1}\left({\displaystyle \frac{\lambda}{1\lambda}}\right)+{\gamma}^{2}{\delta}_{t+2}\left({\displaystyle \frac{{\lambda}^{2}}{1\lambda}}\right)+\mathrm{\dots}\right)$ 
then
$${A}_{t}^{\lambda}=\sum _{l=0}^{\mathrm{\infty}}{(\lambda \gamma )}^{l}{\delta}_{t+l}$$  (2.27) 
Chapter 3 Neural Networks
Artificial neural networks are neurologically inspired computational frameworks, which simulate neuron activity as a weighted summation over a series of connected input neurons. There are many different neural networks topologies that are commonly used, but most consist of a series of connected groups of neurons known as layers, which are fed stimulatory input through an input layer. An output layer (most often the final layer of the network) provides an estimation of some target value that is attempted to being modelled by the network. Neural networks were first developed the in the 1940s [23], however have seen a large increase in usage in recent years, due to increased computational power of modern computers. This allows larger, hence more powerful networks to be trained within a reasonable time frame.
3.1 MultiLayerPerceptron
Inspired by the perceptron algorithm Hinton et al created a feedforward multilayer perceptron (MLP) [24]. MLPs consist of distinct layers of nodes, where each node’s value is a summation of the previous layer $l1$ with the synaptic weights of the current layer ${w}_{ij}^{l}$ under a nonlinear activation function $g$. MLPs can approximate any function with an infinitely sized hidden layer [25] and can be trained to minimise a given loss function via the backpropagation algorithm [26]. Nonlinear activation functions are a key process of the MLP’s ability to approximate any function and they are required to be differentiable in order to update the weights via backpropagation. The value of a node of a hidden layer $l$ is given by:
$${h}_{i}^{l}=g\left(\sum _{j}{w}_{ij}^{l}{h}_{j}^{l1}\right)=g\left({W}^{l}{h}^{l1}\right)\mathit{\hspace{1em}\hspace{1em}}\text{where}W\in {\mathbb{R}}^{n\times m}$$  (3.1) 
A loss function is defined by the final output layer $\widehat{y}$ and the target value $y$ given by the model (weights) $\theta $, $\mathcal{L}(\widehat{y},y\theta )$. As the output layer is a composite function the weights can be updated by differentiating the weights w.r.t the loss function using the chain rule.
$$\frac{\partial \mathcal{L}}{\partial {w}_{ij}^{l}}=\frac{\partial \mathcal{L}}{\partial {z}_{i}^{l}}\frac{\partial {z}_{i}^{l}}{\partial {w}_{ij}^{l}}=\frac{\partial \mathcal{L}}{\partial {h}_{i}^{l}}\frac{\partial {h}_{i}^{l}}{\partial {z}_{i}^{l}}\frac{\partial {z}_{i}^{l}}{\partial {w}_{ij}^{l}}$$  (3.2) 
where $z={\sum}_{j}{w}_{ij}^{l}{h}_{j}^{l1}$. As the loss w.r.t the ${h}^{l}$ is dependent on the next layer $\frac{\partial \mathcal{L}}{\partial {h}_{i}^{l}}$ can be expressed recursively:
$\frac{\partial \mathcal{L}}{\partial {w}_{ij}^{l}}$  $=\left({\displaystyle \sum _{j}}{\displaystyle \frac{\partial \mathcal{L}}{\partial {z}_{j}^{l+1}}}{\displaystyle \frac{\partial {z}_{j}^{l+1}}{\partial {h}_{i}^{l}}}\right){\displaystyle \frac{\partial {h}_{i}^{l}}{\partial {z}_{j}^{l}}}{\displaystyle \frac{\partial {z}_{i}^{l}}{\partial {w}_{ij}^{l}}}$  $\text{Since}{\displaystyle \frac{\partial {z}_{i}^{l}}{\partial {w}_{ij}^{l}}}={h}_{i}^{l1}\text{and}{\displaystyle \frac{\partial {h}_{i}^{l}}{\partial {z}_{j}^{l}}}={\displaystyle \frac{\partial g({z}_{i}^{l})}{\partial {z}_{j}^{l}}}$  
$=\left({\displaystyle \sum _{j}}{\displaystyle \frac{\partial \mathcal{L}}{\partial {z}_{j}^{l+1}}}{\displaystyle \frac{\partial {z}_{j}^{l+1}}{\partial {h}_{i}^{l}}}\right){\displaystyle \frac{\partial g({z}_{i}^{l})}{\partial {z}_{j}^{l}}}{h}_{i}^{l1}$  (3.3) 
3.2 Convolutional Neural Networks
Convolutional neural networks (CNN) are a visual cortex inspired neural network similar to the MLP as described above. They differ by the replacing the linear summation by a convolutional operation across the input. The convolution operation can be physically interpreted as an area overlap between two functions [27], in the context of neural networks it can be loosely interpreted as comparing a signal (image in video games) to a set of learnt filters which have a feature representation of different components of a signal. Using a convolutional layer allows the network to learn spatially or temporally invariant features across an input signal as the same filter is used across different sections of the input. In the context of images, this is equivalent to learning specific features, for example the features of a car regardless of the location of the car within the input. The convolution operation is defined by,
$f(t)*g(t)={\displaystyle {\int}_{\mathrm{\infty}}^{\mathrm{\infty}}}f(\tau )g(t\tau )\mathit{d}\tau $  For continuous functions  (3.4)  
$x[m]*y[m]={\displaystyle \sum _{n=\mathrm{\infty}}^{\mathrm{\infty}}}x[n]y[mn]$  For discrete functions 
The hidden layer component ${h}_{ij}^{l}$ for a 3D fully convolutional layer with a single filter is given by,
${h}_{ij}^{l}=g\left({w}_{ijk}^{l}*{h}_{ij}^{l1}\right)=g\left({\displaystyle \sum _{x}}{\displaystyle \sum _{y}}{\displaystyle \sum _{z}}{w}_{x,y,z}^{l}{h}_{ix,jy,kz}^{l1}\right)$  $\text{for a singular}k\times k\times c\text{filter}{w}_{ijk}^{l}.$  (3.5) 
3.3 Recurrent Neural Networks
CNN and MLP are both examples of feedforward neural network architectures. Specifically, these are both discrete acyclic graphs, meaning that each node in the graph is not connected to itself via any path. This reduces the capability of these networks for modelling sequential data, when the features $x\in {\mathbb{R}}^{n}$ are fed into the network at each time step $t$. This is because, in acyclic graphs there in no modelled relation between each pass of the network. Recurrent Neural Networks (RNN), solve this problem by combining the hidden state ${h}_{t}$ of the feature ${x}_{t}$, with the hidden state ${h}_{t1}$ of the previous feature ${x}_{t1}$. The hidden layer at time $t$ for a singlehidden layer RNN is given by the equation,
$${h}_{t}=g\left({W}_{x}{x}_{t}+{W}_{h}{h}_{t1}\right)\mathit{\hspace{1em}\hspace{1em}}\text{where}{W}_{x}\in {\mathbb{R}}^{m\times n}\text{,}{W}_{h}\in {\mathbb{R}}^{m\times m}.$$  (3.6) 
The output ${y}_{t}$ is defined by an additional MLP layer, and can be produced every timestep in a manytomany configuration or at the final timestep $T$ in a manytoone configuration. Another configuration includes that of onetomany where the output ${y}_{t1}$ is fed into 3.6 instead of the previous hidden state ${h}_{t1}$.
As the hidden state ${h}_{t}$ hence output ${y}_{t}$ is conditioned of the previous step, the backpropagation algorithms flows backwards through the time steps $T\to {t}_{0}$, hence is known as backpropagation through time. However the drawbacks of this simple recurrent network is that of the vanishing gradients problem, where the gradients decrease in size from $T$ to ${t}_{0}$, hence simple RNNs can struggle to learn from long sequential data. This problem is reduced by the introduction of Long Term Short Memory (LSTM) recurrent networks which use multiple gates, in order to learn to extract more useful longterm important features.
3.4 Activation Functions
Many different nonlinear activation functions are used for artificial neural networks and bring different strengths and weakness depending on the usecase. A brief description of each activation along with the relevant formula is given below.
The hyperbolic and logistic activation function families have been a popular choice throughout the history of neural networks. The two most common choices however are the tanh and sigmoid functions, these are given by the equations,
$g(x)=tanh(x)={\displaystyle \frac{\left({e}^{x}{e}^{x}\right)}{\left({e}^{x}+{e}^{x}\right)}}$  tanh  (3.7)  
$g(x)=\sigma (x)={\displaystyle \frac{1}{1+{e}^{x}}}$  $\text{sigmoid}.$  (3.8) 
The tanh and sigmoid function constrain the value of $x$ to within the boundaries $[0,1]$, $[1,1]$ respectively. This can lead to vanishing gradients [28] for deeper neural networks when training with backpropagation. This is because the gradient is restricted to between the values of $[0,1]$, since backpropagation uses the chain rule, these gradients can quickly tend towards zero resulting in very small weight updates, as well as numerical error effects dominating the weight updates.
Rectified linear unit (ReLU) is perhaps the most popular activation function due to its simplicity, speed and often performance increase. It is given by the equation,
$$g(x)=\{\begin{array}{cc}0\hfill & \text{for}x\le 0\hfill \\ x\hfill & \text{for}x0\hfill \end{array}\mathit{\hspace{1em}\hspace{1em}}\text{ReLU.}$$  (3.9) 
The performance increase has often been attributed to the increase in sparsity of the connections that carry any nonzero value, this has been observed to be between 50%80% in [29] for deep neural networks. Sparsity of neural connectivity is seen as a more biologically plausible model, as it is estimated that neurons in the human brain have a sparsity of between 90%99% [29]. This intuitively makes sense as for random weight initialisation with mean zero should produce 50% sparsity, as the sets of positive and negative real numbers are equal in size. The vanishing gradient problem is negated as the gradient of the ReLU function is either 1 for positive $x$ or zero otherwise
$$\frac{\partial g(x)}{\partial x}=\{\begin{array}{cc}0\hfill & \text{for}x\le 0\hfill \\ 1\hfill & \text{for}x0\hfill \end{array}$$ 
, hence the gradients do not reduce in size over multiple layers. However the downside is that ReLU units can enter a value such that the neuron is never activated again, this is known as the dying ReLU problem and is solved by replacing the zero value in 3.9 by some function of $x$. For example it is replaced by $\alpha x$ in the leaky ReLU and $a({e}^{x}1)$ in the Exponential Linear Unit (ELU) functions.
$g(x)=\{\begin{array}{cc}\alpha x\hfill & \text{for}x\le 0\hfill \\ x\hfill & \text{for}x0\hfill \end{array}$  leaky ReLU.  (3.10)  
$g(x)=\{\begin{array}{cc}a({e}^{x}1)\hfill & \text{for}x\le 0\hfill \\ x\hfill & \text{for}x0\hfill \end{array}$  ELU.  (3.11) 
The softmax activation function produces a normalised probability distribution over each feature in an input vector $x\in {\mathbb{R}}^{d}\to z\in {\mathbb{R}}^{d}$, such that the probability of ${z}_{i}$ is is proportional to the exponent of the input value ${x}_{i}$ [9]. This is explicitly calculated by the formula,
$g({x}_{i})={\displaystyle \frac{{e}^{{x}_{i}}}{\sum _{k\in d}{e}^{{x}_{k}}}}$  $\text{softmax}.$  (3.12) 
As the nature of this activation function this is generally used on the final layer of the network for multinomial logistic regression.
3.5 Loss Functions
In order to calculate the gradients a distance metric between the desired output $y\in {\mathbb{R}}^{N}$ and the actual output $\widehat{y}\in {\mathbb{R}}^{N}$ must be defined. Mean squared error is a common distance metric which penalises the difference between two vectors by the mean of the squared distance between each feature
$$MSE={y\widehat{y}}^{2}=\frac{1}{N}\sum _{i=0}^{N}{({y}_{i}\widehat{{y}_{i}})}^{2}.$$  (3.13) 
A better comparison between two probability distributions than the mean squared error of the distributions is the cross entropy. Entropy is a measurement of uncertainty in a probability distribution. Entropy is highest when a distribution is uniform and lowest as the distribution approaches the Dirac delta function. For example the uncertainty is high for the outcome of a fair sided die roll, hence it has high entropy. For a discrete probability distribution entropy is given by,
$$H(x)=\sum _{i}p({x}_{i})\mathrm{log}p({x}_{i}).$$  (3.14) 
Information theory states that a message contains high information when the outcome is unpredictable [30], as you have gained information about the outcome. Information is lowest when the outcome is known and predictable, as you already had the information about what the outcome would be. Entropy then can be seen as the amount of information in a signal. The KullbackLeibler (KL) divergence can be seen as the amount of information needed to encode the information in the probability distribution $p$ with another distribution $q$ [31], it is lowest when $q=p$ and hence can be seen as an error between these two probability distributions. Formally this is the expected log difference between the two distributions,
$${D}_{KL}(pq)={\mathbb{E}}_{x}[\mathrm{log}p\mathrm{log}q]=\sum _{i}p({x}_{i})\mathrm{log}p({x}_{i})p({x}_{i})\mathrm{log}q({x}_{i})=\sum _{i}p({x}_{i})\mathrm{log}\left(\frac{p({x}_{i})}{q({x}_{i})}\right).$$  (3.15) 
However this is not a true distance metric because it is not symmetrical ${D}_{KL}(pq)\ne {D}_{KL}(qp)$, nor does it satisfy the triangle equality [32].
The second term in equation 3.15, ${\sum}_{i}p({x}_{i})\mathrm{log}q({x}_{i})$ is also known as the cross entropy between $p$ and $q$. For a fixed $p$, for example the labels for a classification task, minimising the cross entropy is equivalent to minimising the KLdivergence. Minimising both the mean squared error and cross entropy are equivalent to maximising the likelihood of the data given the parameters [9].
3.6 Optimisation Methods
As backpropagation is a gradient based optimisation process, that minimises the parameters in some loss space $L$. Using gradient based optimisation allows the parameters to approach at least a local minimum for convex problems as well as stationary points for nonconvex problems [33]. As neural networks require very large amount of data points compared to other Machine Learning algorithms, updating the gradient based on the average over all data points in a data set is impracticable. Thus, Stochastic Gradient Descent (SGD) methods are often used to increase training speed.
Using SGD methods is especially important as all data points cannot be accessed at any time, as it is being continually generated by the agent. SGD, however requires heavy tuning in order to find a optimal learning rate $\eta $ for different tasks, such that the parameters do not get stuck in local minima and preferable one that converges with as few updates as needed. Because of this modified SGD algorithms are often used throughout machine learning, especially with neural networks.
3.6.1 Momentum based SGD
In order to reduce stochasticity of the SGD updates, the concept of momentum of gradient updates was first used in [34]. Here the analogy is that of a particle rolling down a hill, it experiences a force pulling on it which increases its momentum. When the gradients are steeper the particle accelerates, and as the hill flattens it decelerates. This stabilises the direction of the gradient updates as the momentum makes it less susceptible to large deviations caused by a single minibatch update. If the optimiser has a large enough momentum it can cause it to escape suboptimal local minima. However, too large momentum can cause the optimiser to overshoot minima thus being slower to optimally converge, as seen in Figure 3.1.
3.6.2 RMSProp
Root Mean Squared Propagation (RMSProp) [35], adapts the magnitude of the update for each parameter by dividing by a running average of the root mean squared gradient $v={\overline{W}}^{2}$, this allows different features to be learnt at different scales, thus reducing the need to finely tune the learning rate. For a neural network layer weighting $W$, the update rule can be written as
${v}_{t}\leftarrow \beta {v}_{t1}+\left(1\beta \right){\left({\displaystyle \frac{\partial J}{\partial {W}_{t}}}\right)}^{2}$  (3.16)  
${W}_{t+1}\leftarrow {W}_{t}{\displaystyle \frac{\eta}{\sqrt{{v}_{t}}}}{\left({\displaystyle \frac{\partial J}{\partial {W}_{t}}}\right)}^{2}.$ 
Where $\beta $ is the decay rate of the average.
3.6.3 Adam
Adaptive Moment Estimation (Adam) [36] combines the adaptive magnitude updates from RMSProp 3.16 with momentum based gradient descent [36], with bias correction for momentum and adaptive moment (mean square of the gradient)
${m}_{t}\leftarrow {\beta}_{m}{m}_{t1}+\left(1{\beta}_{m}\right)\left({\displaystyle \frac{\partial J}{\partial {W}_{t}}}\right)\mathit{\hspace{1em}\hspace{1em}}\text{momentum}$  (3.17)  
${v}_{t}\leftarrow {\beta}_{v}{v}_{t1}+\left(1{\beta}_{v}\right){\left({\displaystyle \frac{\partial J}{\partial {W}_{t}}}\right)}^{2}\mathit{\hspace{1em}\hspace{1em}}\text{adaptive moment}$  
${\widehat{m}}_{t}\leftarrow {m}_{t}/\left(1{\beta}_{m}^{t}\right)\mathit{\hspace{1em}\hspace{1em}}\text{bias correction for momentum}$  
${\widehat{v}}_{t}\leftarrow {v}_{t}/\left(1{\beta}_{v}^{t}\right)\mathit{\hspace{1em}\hspace{1em}}\text{bias correction for adaptive moment}$  
${W}_{t+1}\leftarrow {W}_{t}\eta {\displaystyle \frac{{\widehat{m}}_{t}}{\sqrt{{\widehat{v}}_{t}}+\u03f5}}$ 
Where ${\beta}_{m}^{t},{\beta}_{v}^{t}$ are the decay rates with exponent of the number of updates $t$, and $\u03f5$ is a small scalar value to prevent division by zero [36].
Chapter 4 Deep Reinforcement Learning
In video games it is common for the set of states $\mathcal{S}$ to be given as a pixel array of the game screen $s\in {\mathbb{R}}^{h\times w\times c}$, for a single colour frame of size $84\times 84$ there are ${84}^{2}\times {(255)}^{3}\approx 1\times {10}^{11}$ different image combinations. Most video games will likely only use a tiny fraction of this image space, but it is still impractical to model the value functions for each state separately. A solution is to use a differentiable function to model $V(s)$, this is known as value function approximation [12]. Deep Reinforcement Learning refers to using a deep neural network as a nonlinear value function or policy approximator.
4.1 Traditional Deep Reinforcement Learning
4.1.1 Deep Q Learning
The Deep Q learning algorithm proposed by [6] is essentially the Q learning algorithm described previously (2.17), with two additional key features that stabilise the policy iterations for the use of neural networks. The two additional features are experience replay, and fixed target policy iteration. If using a gradient based iterative optimisation process as done via backpropagation in neural networks, it is important that the data is independent and identically distributed (i.i.d.). This is done as to avoid sampling bias from correlated inputs, which can cause the gradient to get stuck in a nonoptimal local maxima (as we are performing gradient ascent to maximise the expected future reward). Experience replay is a method to help decorrelate the sequential experiences gained from dynamic programming and model free reinforcement learning methods. This is done by storing experiences, a tuple of $$ into a list of experiences known as the replay memory. Batch samples can be drawn randomly from the replay memory which provide $\sim $ i.i.d. for a large replay length. Deep Q Learning also solves the partially observable MDP problem by providing the input of the neural network with 4 concatenated previous states (greyscaled images for CNN).
The Deep Q Learning algorithm is expressed below :
4.1.2 Advantage Actor Critic
Asynchronous Advantage ActorCritic (A3C) [37] is an algorithm where multiple actors interact and learn from their own environment asynchronously. Each actor has it’s own local policy parameters which it uses to calculate the gradients of the policy and value function over nsteps, this is done using TD($n$). The gradients calculated by each actor are asynchronously applied to shared global parameters, and after each $n$step rollout the local parameters are updated from the global parameters. Running multiple environments allows the agent to learn from many uncorrelated experiences, thus reducing the impact onpolicy training has on local optimal convergence. Thus, is seen as a replacement for an experience replay buffer. The greatest benefit of the A3C and other asynchronous method is the reduced training time over single actor algorithms, especially when using distributed computing [38].
A2C [39] is a synchronous version of A3C, where multiple actors interact synchronously using a single shared policynetwork. Running these environments synchronously removes the need to correct for any offpolicy correction and allows for greater GPU utilisation.
4.1.3 Synchronous DDQN
In [37] the asynchronous multiactor framework was also extended to include versions which use stateaction value functions, e.g. SARSA and QLearning. Synchronous $n$step Double DQN (Sync DDQN), is a synchronous version of the async DQN algorithm proposed in [37], with Double DQN parametrisation following [41]. The network used to provide the value of action was a fixed copy of the online network $\theta $. The online network was used to select the action ${a}_{t}$ under the epsilon greedy policy. The value used to bootstrap the estimated future return from 4.2 is then replaced by $Q({s}_{t},\mathrm{arg}{\mathrm{max}}_{a}Q({s}_{t},a{\theta}_{i}){\theta}_{i1})$, where ${\theta}_{i1}$ is a periodic copy of $\theta $ updated every $C$ steps similar to 4.1.
4.1.4 Proximal Policy Optimisation
Trust Regions
In reinforcement learning the training data is dynamically produced by the agent. This means too large policy update can significantly alter the behaviour of the agent. Too large an update can cause the agent to move into parameter space where it converges on suboptimal behaviour or worse, it no longer receives any rewards from the environment. A safer method to update the policy is to stick within a trusted region of the current policy as to not cause too dramatic updates. In Trust Region Policy Optimisation TPRO, [42] the surrogate objective function for a specific state, action pair ${s}_{t},{a}_{t}$ is the policy scaled by the policy of this pair w.r.t. the old policy parameters ${\theta}_{old}$, penalised by the KLdivergence between the old policy and the new policy. This is given by the equation
$$\underset{\mathit{\theta}}{\mathrm{maximize}}{\mathbb{E}}_{t}[\frac{\pi \left({a}_{t}{s}_{t},\theta \right)}{\pi \left({a}_{t}{s}_{t},{\theta}_{old}\right)}{A}_{t}\beta \mathrm{KL}(\pi (\cdot {s}_{t},{\theta}_{old})\pi (\cdot {s}_{t},\theta ))]\text{[43]}.$$  (4.1) 
$\beta $ is a constant that controls the strength of the KLdivergence penalty. The original derived quantity leads to a too small an update [42], thus requires tuning for each task, or requires further complexity by adaptively changing the strength throughout training [43]. In [42] they overcome this by optimising the surrogate objective ${\mathbb{E}}_{t}\left[\frac{\pi \left({a}_{t}{s}_{t},\theta \right)}{\pi \left({a}_{t}{s}_{t},{\theta}_{old}\right)}{A}_{t}\right]$
under the constraint that the KLdivergence is less than some scalar value $\delta $, instead of being penalised by it. This requires second order optimisation methods introducing even further complexity.
PPO
Proximal Policy Optimisation PPO [43], is a first order optimisation method that is used to calculate a lowerbound for the policy update. This is done in the form of clipping the policy ratio. The minimum value between the unclipped surrogate policy objective, and the clipped surrogate policy objective is chosen to perform the update on $\theta $.
$$\underset{\mathit{\theta}}{\mathrm{maximize}}{\mathbb{E}}_{t}\left[min(\frac{\pi ({a}_{t}{s}_{t},\theta )}{\pi ({a}_{t}{s}_{t},{\theta}_{old})}{A}_{t},clip(\frac{\pi ({a}_{t}{s}_{t},\theta )}{\pi ({a}_{t}{s}_{t},{\theta}_{old})},\mathrm{\hspace{0.17em}1}\u03f5,\mathrm{\hspace{0.17em}1}+\u03f5){A}_{t})\right]\text{[43]}.$$  (4.2) 
PPO is optimised over several epochs for each $n$step sample, no theoretical justification is given in [43] for this. This is a problem for policy gradient methods as after the first epoch the updated parameters are longer the parameters that produced the current training data, and thus further epochs are offpolicy updates. The use of multiple updates could be justified by claiming the parameter updates are small enough as they constrained by the old policy and the clipping bound, such that the new policy is never too far from the old policy. Thus, reducing the chances of suboptimal convergence.
As stated previously the Q Learning algorithm suffers from a maximisation bias and poor exploration. To improve on [6], methods such as $TD(n)$ and Double Q Learning as previously explained can be implemented along with additional reinforcement learning methods such as Duelling Q Networks, A3C and prioritised experience replay. These provide significant gains on the initial DQN algorithm as shown in [37] and [44]. Although, even after combining all these deep reinforcement learning methods, deep reinforcement learning still suffers from sampling inefficiency requiring up to 50 million state samples (200 million frames) in order to to converge on an policy equivalent to humanlevel performance in Atari2600 games. Furthermore, all of these improvements fail to achieve near human level performance in sparse extrinsic reward games such as MontezumaRevenge or PrivateEye. 200 million frames is equivalent to playing $\sim $ 38 days in realtime, playing at 60fps. As stated earlier this is extremely inefficient learning when contrasted against humans who are often able to perform well in both dense and sparse reward games after a single or few attempts.
4.2 Deep Reinforcement Solutions to Sparse Extrinsic Rewards
4.2.1 Curiosity Driven Exploration via NextState Prediction
Curiosity driven exploration in deep reinforcement learning as seen in [8] attempts to solve the lack of frequent rewards by introducing a intrinsic reward function. The intrinsic reward function provides rewards when the agent experiences novel states. In order to avoid the agent from seeking states of high stochasticity which the agent cannot model e.g. leaves fluttering in a breeze [8], the agent learns an exploration model only for features that the agent can interact with. The intrinsic reward is calculated by intrinsic curiosity model (ICM) in [8]. The ICM consists of two models the inverse dynamics model which tries to predict the action ${a}_{t}$ given an encoded version of states $\varphi ({s}_{t})$ and $\varphi ({s}_{t+1})$ and the forward model that tries to predict the next encoded state $\varphi ({s}_{t+1})$ using the actual action taken by the agent ${a}_{t}$ and the encoded state $\varphi ({s}_{t})$. In [15] the intrinsic reward ${i}_{t}$ is given by the mean squared error ${i}_{t}={\widehat{\varphi}({s}_{t})\varphi ({s}_{t+1})}^{2}$. By jointly optimising the inverse and forward model, the agent gains no benefit of trying to model any part of the environment that isn’t affected by the action taken [8]. However, this introduces a problem as parts of the environment that are not immediately affected by the agent’s policy, may be important in the future [15].
4.2.2 Unsupervised Auxiliary Tasks
The UNsupervised REinforcement and Auxiliary Learning (UNREAL) [45] agent increases the sample efficiency of the A3C agent by extracting additional reward signals from the environment.
One additional task is that of pixel control (PC), where an auxiliary policy ${Q}^{aux}$ is trained to maximise the pixel intensity across different sections of the image input over an $n$step period as seen in Figure 4.4(b). The auxiliary stateaction function is trained by minimising the meansquared error of the between the predicted $n$step pixel change given as
$${L}_{{Q}^{aux}}=\mathbb{E}\left[{\left({R}_{t}^{n}+{\gamma}^{n}\underset{{a}^{\prime}}{\mathrm{max}}{Q}^{aux}({s}^{\prime},{a}^{\prime}{\theta}_{i1}){Q}^{aux}(s,a{\theta}_{i})\right)}^{2}\right]\text{[45]}.$$  (4.3) 
The other auxiliary tasks are that of reward prediction (RP), where the agent tries to predict the reward received only for the next step given three preceding frames Figure 4.4(c), and finally the value function replay (VR) which further trains the value function $V({s}_{t})$ from a replay buffer 4.4(d). All three auxiliary tasks are trained offpolicy where input state sequences are sampled from a small experience replay buffer. As the sequences sampled have different starting points to the online rollouts, this allows the agent to learn new temporal relations between states [45]. Here the three auxiliary tasks use the agents’ existing neural network to jointly optimise the network for all tasks. The total loss function is then,
$$L={L}_{A3C}+{\lambda}_{PC}{L}_{{Q}^{aux}}+{\lambda}_{VR}{L}_{VR}+{\lambda}_{RP}{L}_{RP}$$  (4.4) 
The auxiliary tasks do not directly affect the policy $\pi $, but rather shape the features in the encoder network which could then affect the agent’s behaviour. However, Unlike the curiosity driven exploration the pixel control task learns to maximise the expected change in pixels, this could possibly indirectly lead to the agent seeking states of high stochasticity if ${L}_{{Q}^{aux}}$ is too large, as it will bias the encoded features towards those that change the pixel intensity the most.
4.2.3 Random Network Distillation
The novelty of a state can be also measured by the number of counts the agent has been in that state, in [46] this is combined with intrinsic motivation by providing an additional reward signal which is inversely proportional to the count. This method was simplified in [14] via using a Random Network Distillation (RND). RND consists of a base policy ${\pi}_{\theta}$, as well as randomly initialised target and predictor networks $f$ and $\widehat{f}$ respectively. The target network is fixed after initialisation and the predictor network is trained to predict the output of the target network for the next state ${s}_{t+1}$. The intrinsic reward is defined as the mean squared error between the two outputs ${i}_{t}={\widehat{f}({s}_{t+1})f({s}_{t+1})}^{2}$ and the predictor network is trained by minimising this loss [14]. By using a lower dimensional representation this should reduce the effect of the agent seeking states of high stochasticity as mentioned earlier.
Minimising the distance between $f$ and $\widehat{f}$ implicitly learns the count of the number times a state has been visited, as the predictor network cannot predict the target network output with greater than random probability without first visiting the state. As the target network $f$ output is deterministic i.e. constant, this means the output is learnable and the loss can approach at least a local minima through gradient descent. This is notably different to a forward dynamics model, where for certain nonstochastic environments such as simplistic physics based environments the next state can be predicted without necessarily having to have explicitly visited that state, as the forward dynamics model learns the true transition model ${\mathcal{P}}_{s{s}^{\prime}}^{a}$. This is likened to the agent getting bored in [8], as the outcome becomes more predictable hence less novel.
4.2.4 Hindsight Experience Replay
Auxiliary Tasks and curiosity driven exploration both attempt to explore unknown states by maximising the statistical novelty of a state compared to states the agent previously experienced. This only will work if there is a variation amongst the states being explored, however in some tasks the states lack diversity hence the ICM and pixel control policies will not sufficiently be able to explore the state space [47]. This is especially crucial for large state spaces with very few rewards. The proposed solution uses every episode the agent performs to provide a reward even if the agent does not achieve the said goal. This is done by creating virtual goals which use the final the state the agent ended up in as the desired outcome. An example given in [47] is using a robotic arm to moving an object to a specific coordinate (x,y). For a large x,y plane randomly sampling a specific location would be statistically negligible, so after every episode the agent updates it’s value function by ”pretending” that the final (x,y) coordinate was the desired outcome. Hence, the agent learns to move the object to any specified coordinate.
Chapter 5 Evaluation and Testing
5.1 Environments
In order to test the efficacy of the algorithms discussed above, a suitable MDP environment must be used. Several environments were used to provide a variety of environment complexities and task difficulties, but crucially a difference in sparsity of the extrinsic reward received. The environments include that of classic control games {CartPole, Acrobot, MountainCar} and a small selection of Atari games {SpaceInvaders, Freeway, MontezumaRevenge} were chosen to meet the above requirements, as well as their popularity so that there were plenty of reference results in the literature. All environments were executed using OpenAI’s gym toolkit.
5.1.1 Classic Control
These simplistic environments represent the state of the environment with a vector $s\in {\mathbb{R}}^{n}$, with each feature corresponding to a physical variable of the system that the agent needs to learn to control in order to gain rewards. The CartPolev1 game consists of an unstable inverted pendulum (pole) attached to a cart and the goal is to keep the pole upright. The agent can move the cart left or right via applying a discrete force of $\pm 1$ along the horizontal axis of the cart. The game finishes when the pole moves greater than 15°either side of the upright position, moving $\pm 2.4$ units from the centre or after 500 steps. The agent receives a reward +1 for each step until the episode is finished [48].
The Acrobotv1 environment consists of a double pendulum in which the goal is to try to swing the tip of the lower pendulum to a specific target height. The agent can move the pendulum by applying a torque of {+1, 0, 1} to the joining link. Every step the agent receives a negative reward other than the target state [49].
The final control task is MountainCarv0, where the goal is to reach the top of the mountain from the bottom of the valley. However, the car does not have enough power to directly drive up the side, therefore in order to solve the problem, the agent must gain momentum from oscillating either up slide of the mountain. The agent receives a 1 reward for every step the agent is not at the top and the episode terminates at 200 steps. Therefore the agent will receive a 200 reward unless it learns to reach the top in under 200 moves [50].
At first glance these as all these tasks contain a dense extrinsic reward signal as they produce a non neutral at each step, so one might expect that they should be solved easily. However, under further inspection in Acrobot and MountainCar tasks, the rewards contain very little information about the stateaction value, as all actions are considered equally bad regardless of how close the agent is to the optimal behaviour. An agent must sufficiently explore the state space in order to achieve the a higher reward, hence learn from it.
5.1.2 Atari 2600
For the Atari 2600 games, OpenAI provides wrapper to the Arcade Learning Environment (ALE). The ALE produces a 210x160x3 RGB image for each frame, however many frames are duplicates or are very similar so it has been a common practise since [6] to only consider the nth successive frames, repeating the same action n times. This helps reduce the computational cost by concatenating n steps in the MDP onto a single step. Another preprocessing step performed is to convert the images to greyscale and scale and crop the image to a size 84x84. The final preprocessing step for the Atari environments is to concatenate the current frame with the previous k frames into a single image. This solves the partially observable MDP problem, providing additional temporal information into the state (such as velocity of the ball in pong) that would not be present in a static frame. This allows the use of convolutional neural networks to be used as the nonlinear statevalue function approximator.
As the environment is deterministic a random amount of ‘noop’ (neutral) actions were taken at the start of each episode, this changes the initial starting state, and is done to prevent the agent ‘memorising’ good actions from frequently visited target states. Sometimes an loss of life, done or terminal flag is used to increase the agent’s learning efficiency and, performance. This means when bootstrapping for the estimate of the expected future return, a loss of life is treated as the terminal state. The reasoning being that using the flag will allow the agent to quickly learn losing a life is to be avoided. Another commonly used practise is to cap the maximum number of steps an agent can take in the environment before the episode terminates. This penalises idle behaviour, thus should make the agent more efficient. The exact environment parameters are given in 7.1
In SpaceInvaders a series of waves of aliens zigzag down the screen firing at the agent, the agent’s goal is to shoot the aliens before they reach a certain threshold height. The game increases in difficulty as the aliens increase their speed of descent as their numbers go down. a life is lost when the aliens reach the threshold and the game is ended when all 3 lives are lost or the player has defeated all waves of aliens.
In the game of Freeway the goal is for the player to reach the other side of the road without being hit by the oncoming traffic. The player receives a reward of +1 when reaching the other side and 0 otherwise.
In the game MontezumaRevenge the goal is to explore temple rooms and collect amulets. Navigating the different rooms requires the collecting objects such as keys found throughout the level, as well as stringing together a series of complex movements to avoid falling and traps.
The games were chosen for varying sparsity of rewards as according to [46], SpaceInvaders provide a humanoptimal environment with relatively dense reward feedback and Freeway and MontezumaRevenge provide the sparse reward environments with easy and difficult search respectively.
Chapter 6 Implementation and Analysis
6.1 Software and Hardware
Python was the language of choice for implementing these algorithms as it is widely used in the reinforcement learning community, such as OpenAI’s baselines [40] allowing for better troubleshooting and debugging. Also, Python is one the most common used API for neural network packages, such as the popular Tensorflow and Pytorch packages. All neural networks were implemented using Tensorflow’s python API. Tensorflow is a Machine Learning framework that allows users to build graphical models using ‘tensors’, Tensorflow’s internal representation of a generalised matrix ^{1}^{1} 1 Note that these tensors are distinct in notation and properties from the tensors in tensor calculus [51], and provides builtin automatic differentiation, which automatically handles gradient calculation and backpropagation. And all the experiments were ran on either a single machine with a 8core, 16thread AMD Ryzen^{TM} 1800x with a single NVIDIA^{®} GeForce^{®} RTX 2070 GPU and 48GB of DDR4 RAM, or, on a single GPUnode on the Sheffield Advanced Research Computer (ShARC) high performance computing system, which contained a NVIDIA^{®} Tesla K80 with a Intel^{®} Xeon^{TM} E52630 v3 CPU. All references to computational speed and timing are those observed using the single machine setup.
6.2 Scalable Reinforcement Learning
In order to perform the Atari experiments over 200 million frames, it is crucial that the implementation use multiple workers to run the environments, these methods scale number of steps taken in the environment per second sub linearly. Many different scalable frameworks were attempted such as A3C and IMPALA [38], however due to Tensorflow’s incompatibility with python’s multiprocessing library these methods required multithreading, or using Tensorflow’s distributed training framework. Due python’s Global Interpretable Lock, multithreading is not truly parallel, only concurrent and attempted implementations using Tensorflow’s distributed training for a single machine were relatively slow compared to results shown in the literature [38]. The A2C framework, however was observed to increase the number of steps per second by a factor of 10, from $\sim $ 250 to $\sim $ 23002700 steps per second. The A2C framework reduces the need or multiple copies of the network by running a single step in multiple environment in parallel. The implementation of running multiple environments synchronously in parallel was adapted from OpenAI’s implementation [40], and achieved this by running an individual environment in a separate process using python’s multiprocessing library. Having the environments run synchronously reduces the overhead of passing gradients between processes as well as the need to account for offpolicy corrections. However, the most crucial benefit is the increased efficiency of the use GPU when using a single machine. This is as value estimations can be ran in a single batch decreasing the time spent of transferring data onto the GPU, which is a large bottleneck for GPU applications.
6.3 Encoder Neural Networks
All implementations used neural networks to encode the state ${s}_{t}$ onto a high dimensional feature vector, which then would be used to predict the value function of policy distribution from. In order to be consistent across experiments, every agent used the same encoder networks architectures.
For the Classic Control tasks the encoder network was two MLP layers of with 64 hidden units each, under the ReLU activation function.
For the Atari experiments a CNN encoder network was used, as in [13], as it is a commonly used to benchmark to test different algorithms on. This CNN is parametrised as following, first a 2D convolutional layer with filter size of $[8,8]$ of with 32 output channels and stride $[4,4]$, following a 2D convolutional layer with filter size $[4,4]$, 64 output channels and stride $[2,2]$. A final convolutional layer with 64 output channels, filter size $[3,3]$ and stride $[1,1]$. The resulting convolutional layer was flattened to a vector and a final fully connected MLP layer with 512 hidden units. All layers were under the ReLU activation function, and all convolutional layers had the ’VALID’ padding. The observational input range for the policy CNN encoder is scaled from $[0,255]$ to $[0,1]$ for each pixel by diving by 255 to avoid to large ReLU activations as it is unbounded in the positive direction equation 3.9.
Both encoder networks use the Glorot uniform weight initialisation and zero bias initialisation.
6.4 A2C
For the AdvantageActor Critic A2C network, The actor and the critic share the same encoder network with the value function and unnormalised log probabilities (logits) being linear projections (via a fully connected MLP layer with no activation) from the encoded feature vector. To get the policy distribution $\pi $, the softmax activation function was applied to the logits, resulting in a policy distribution over the discrete set of available actions. The gradient of the entropy over the policy distribution ${\beta}_{H}{\nabla}_{\theta}H(\pi (s,\theta ))$ is added to the loss function following [37] in order to provide regularisation of the policy such that is doesn’t converge on a sub optimal policy. By maximising the entropy, the uniformity of the probability distribution increases, this should increases exploration as previously unlikely actions are sampled with greater frequency. The strength of the entropy maximisation is controlled by the parameter ${\beta}_{H}$ [37].
6.5 Synchronous DDQN
For the Synchronous $n$step DDQN network, the stateaction value function was a linear projection from the encoder network. Here double Qlearning was used to reduce maximisation bias.
6.6 PPO
The PPO implementation uses the exact same model as the A2C model mentioned above. A separate network for the old policy parameters ${\theta}_{old}$ is not used, instead the policies for each rollout are stored and then passed into the network during backpropagation.
6.7 Unsupervised Auxiliary tasks
6.7.1 Original Implementation
A3C Policy
In [45] the A3C agent is parametrised by an encoded CNNLSTM network, which takes in an RGB colour image of size $[84,84]$ as it’s input state ${s}_{t}$ for time t. It then encodes this onto a feature vector using a small two layer CNN following the original DQN paper implementation [6]. The CNNencoded state is concatenated with the previous reward and action is then fed into a LSTM layer with 256 hidden units, from this the policy and value function are projected as described in the A2C implementation.
Pixel Control
For The pixel reward for each time step the image at time $t$ is cropped to a the central region of size $80\times 80$ pixels, and subdivided into $20\times 20$ cells of size $4\times 4$. The pixel reward for time $t$ is then defined by the absolute difference between the mean over all colour channels and all pixels for each cell, of the input image ${s}_{t}$ and the previous image ${s}_{t1}$.
In order to produce the ${Q}^{aux}$ estimate of expected pixel change from each subdivided cell region, a MLP layer with relu is added the LSTM output layer, this is then reshaped to size $32\times 7\times 7$. A deconvolutional layer with 32 output channels, filter size $[3,3]$ and stride one produces a $32\times 9$ feature map. Two deconvolutional layers with output channels 1, number of actions are applied to the $9\times 9$ feature map concurrently, both with filter size $[2,2]$. This produces a value function and advantage function estimate which is used to produce the ${Q}^{aux}$ estimate via duelling neural networks given by [45]
$${Q}^{aux}(s,a\theta )=ReLU\left(V(s\theta )+A(s,a\theta )\frac{1}{\mathcal{A}}\sum _{{a}^{\prime}\in \mathcal{A}}A(s,{a}^{\prime}\theta )\right)\text{[52]}.$$  (6.1) 
Reward Prediction
For the reward prediction a reward ${r}_{t}$ is sampled from the experience replay in a skewed manner such that the probability mass for nonzero rewards is equal 0.5. The three proceeding images $[{s}_{t2},{s}_{t1},{s}_{t}]$ ^{2}^{2} 2 as ${r}_{t}$ is reward from ${s}_{t}\to {s}_{t+1}$ are passed into the CNN encoder, the flattened output features across all three images are concatenated into a single feature vector. This is then passed into a MLP layer with 128 hidden units with the ReLU activation function. This is then projected via another MLP layer onto a categorical distribution using the softmax function. The categories of this distribution are positive negative and neutral rewards, which should be easier to learn than the exact value of ${r}_{t}$, this gradient is calculated via cross entropy loss.
Value Replay
A small experience replay buffer stores the most recent 2000 states for each actor, the gradient of the value function replay task is calculated via mean squared error in the same fashion as the base A3C agent.
6.7.2 Custom A2CCNN implementation (UNREALA2C2)
A2C Policy
In order to be a more accurate comparison to the base A2C agent and for greater GPU utilisation, an CNNonly A2C version of [45] is implemented. In order to get this to work the following changes were made, first the states are converted to greyscale and stacked to the match base A2C and SyncDDQN implementations. In order to be more comparable to the A2C experiments the actionreward concatenation to the encoded state before the value and policy was not performed, as this provides extra temporal information that may improve performance.
Pixel Control
The pixel control reward at each time step is then altered to be defined as the absolute mean difference across the greyscale subdivided region of the most recent image ${x}_{t}$ in the concatenated state ${s}_{t}=[{x}_{t3},{x}_{t2},{x}_{t1},{x}_{t}]$
$${r}_{t}^{PC}=subdivided({x}_{t})subdivided({x}_{t1}).$$ 
Other changes, made to the pixel control task were to not perform the cropping operation and subdivide the image into $21\times 21$ cells. The only change this requires to the deconvolutional network is to increase the number of size of the MLP layer before the first feat map from size $32\times 7\times 7$ to $32\times 8\times 8$. Finally the greyscale images were normalised between the values of (0,1) using a simple minmax normalisation [53] in order to avoid manually fitting the ${\lambda}_{PC}$ term per environment. However, this does introduce a small nonstationarity affect into calculating the pixel rewards as the maximum pixel value of the dataset may increase over the course of training. This is less pronounced for than it would be for perchannel RGB min max normalisation, as well as being more stationary that perpixel zscore image normalisation. This is because in episodic video games tasks many start in a specific location, and when progressing to different areas often means a change in environment textures, thus possibly changing the normalisation parameters significantly. Whereas the change in maximum pixel intensity for a greyscaled images is expected to be negligible.
Reward Prediction
The reward prediction task is the same as the original implementation as described above but, without the frame concatenation as the state already provides necessary temporal information. Instead of sampling rewards from all workers, the worker with the highest reward is greedily chosen to reduce the chance of sparse rewards going unmissed, as well as increasing computational efficiency.
Because of the noticeable modifications to the original UNREAL agent’s algorithm, to distinguish between the two algorithms, the custom agent used is hence referred to as the UNREAL A2CCNN or UNREALA2C2 agent.
6.8 RND
6.8.1 Intrinsic Reward and Value Function
Following [14], The base policy is a PPO policy with an additional separate critic is used to calculate the value of the intrinsic rewards ${V}_{i}(s)$. The weighted combination of the extrinsic and intrinsic advantages is used to update the policy, and the discount factor for the extrinsic reward is increased to $0.999$ as to match [14]. The intrinsic rewards are nonepisodic, as [15] argues that this increases exploration, and they are normalised so that the scale is independent of the observational space [14]. This is achieved by dividing the intrinsic reward by the running estimate of the standard deviation of the intrinsic return. However the intrinsic rewards are nonepisodic so the true return cannot be calculated as it goes on indefinitely. It is also nonstationary as it is produced by a changing parametrised function, so using the intrinsic value estimate to approximate the return is suspected to be an unstable normalisation technique. So the normalisation used in [54] is used, where the intrinsic reward is normalised by the standard deviation of the inversely discount rewards as the past rewards are calculable. This can be explicitly defined as,
$${i}_{t}=\frac{{\leftf({s}_{t+1})\widehat{f}({s}_{t+1})\right}^{2}}{\sigma \left[{\displaystyle \sum _{k=0}^{t}}{\gamma}^{tk}{i}_{t}\right]}.$$ 
In practise the running estimate of the standard deviation is updated after every $n$step rollout and intrinsic return is calculated online.
6.8.2 Target and Predictor networks
As the target network is fixed, it cannot adjust to the scale of the observational inputs, thus normalisation techniques are required in order ensure the output scale is consistent across different environments [14]. The input to the predictor and target networks is the final image ${x}_{t}$ in the state ${s}_{t}=[{x}_{t3},{x}_{t2},{x}_{t1},{x}_{t}]$, this is then normalised by subtracting the running estimate of the perpixel mean and dividing it by the running estimate of the perpixel standard deviation and finally being clipped between the range 5, 5. However, This introduces nonstationarity to target network output thus further nonstationarity to the intrinsic reward. The running estimates of the observational mean and standard deviations are initialised by running a random agent for 50 rollouts for each actor and continually updated during training.
The target and predictor network are both separate models from the policy as to ensure bias free reward prediction, they both use the same convolutional encoder architecture as the policy network.
The target network has a final MLP layer of size 512 with no activation after the encoder. For the predictor network after the convolutional encoder it has two successive MLP layers of size 512, both with the ReLU activation followed by a final MLP layer with size 512 with no activation. The target and predictor networks use a leaky ReLU activations for the convolutional layers, as well as all layers being initialised with orthogonal initialisation with gain $\sqrt{2}$. This was not done to increase performance but to ensure the output of the target and prediction networks were of the same magnitude as the [54] implementation, so that the hyperparameters taken from [14] are valid.
6.9 ICM
For curious agents using the nextstate prediction, the ICM model is used to provide the additional reward for the PPO policy with separate intrinsic value function. The ICM is parametrised by additional separate encoder network using the same architecture and weight initialisation as the base policy encoder, but replacing the observational scaling with the observational normalisation as described in the RND implementation. The output from the encoder $\varphi ({s}_{t})$ is concatenated the with action ${a}_{t}$ and fed into the forward model. The forward model consists of one fully connected MLP layer with ReLU activation, with the number of hidden units the same size of the encoded state, i.e. 512 and 64 for the CNN and MLP encoder respectively. A following MLP with no activation and same size of the previous layer creates the predicted next state $\widehat{\varphi}({s}_{t+1})$ [8].
For the inverse model the encoded state and next state $\varphi ({s}_{t})$ and $\varphi ({s}_{t+1})$ are concatenated into a single feature vector and passed into a MLP layer with ReLU activation and size 512 and 64 for the CNN and MLP respectively. This is fed into a final MLP layer with no activation that maps onto the unnormalised logit probability distributions over the available actions producing $\widehat{{a}_{t}}$. The inverse model is trained as a classification task using cross entropy between the predicted action $\widehat{{a}_{t}}$ and actual action ${a}_{t}$ taken at time $t$, as well the forward model being trained via mean squared error ${\left\widehat{\varphi}({s}_{t+1})\varphi ({s}_{t+1})\right}^{2}$ [8]. Unlike the [15] implementation the forward model and the inverse model gradients are both passed back to the encoder model as in the original implementation [8], with weighting $\beta {L}_{forward}+(1\beta ){L}_{inverse}$, with $\beta =0.2$.
6.10 Random Network Distillation with Auxiliary Learning RANDAL
RAndom Network Distillation with Auxiliary Learning (RANDAL) is a novel method of combining curiosity driven exploration with RND and auxiliary tasks was implemented as following. A base PPO policy with RND is combined with all auxiliary tasks as described in UNREALA2C2 implementation, in order to stop the intrinsic reward from decreasing too rapidly and thus exploring less, a separate intrinsic value function is used as in the RND agent. This allows the value replay be performed exclusively on the extrinsic value function, preserving the rate of change in ${i}_{t}$. As the intrinsic reward is nonstationary, the reward prediction is done only on the sign ($+,,0$) of the extrinsic reward, this is done in hopes to increase stability as extrinsic rewards are stationary as well as being able to continue to use the simplified categorical reward rather than the exact reward value. The target and predictor networks are identical to the RND implementation.
6.11 Analysis
For the baseline architectures of SyncDDQN and A2C, it is expected that A2C agents shall outperform SyncDDQN agents as it is a policy gradient method. As previously mentioned policy gradient methods provide are theorised to have more stable convergence properties, thus should find a more optimal than the value based Sync DDQN agents. However, the A2C algorithm is expected to perform worse on sparse reward tasks as it suffers for poor exploration, e.g. A2C will probably never learn to navigate MountainCar, Freeway or MontezumaRevenge. Again this is because it is an onpolicy algorithm using the estimate of the optimal policy $\pi (as,\theta )$ to sample actions from each turn. The expectation of the Sync DDQN agents is that they should slightly underperform compared to the A2C for dense reward tasks but should perform better for tasks that benefit from random exploration, i.e. should outperform A2C on easy search sparse reward tasks, for the chosen environments this includes the Freeway, Acrobot and MountainCar tasks.
PPO agents should outperform A2C and SyncDDQN agents on all tasks, as the multiple epoch minibatch training should ensure sparse rewards do not get lost in the gradient update, as well as the value function estimate converging to the true value of the policy quicker than the single update algorithms. This reiteration of the value function is comparable to the value function replay in the UNREAL architecture, thus the performance of the PPO agent should be closer to the UNREAL agents than the A2C and SyncDDQN agents.
For the curious agents, the expectation is that the ICM and RND agents should perform equally well for high dimensional complex inputs, such as image observations in Atari games. This is because the forward model will not converge too quickly, providing similar amounts of rewards as the RND agents. For low dimensional noncomplex environments, ICM agents are expected the magnitude of the intrinsic rewards is expected to decrease quickly as the observations are easily predictable, even for unseen states.
Chapter 7 Results and Discussion
7.1 Testing and Validation
For the Atari experiments the number of frames skipped (actions repeated) was 4, meaning each environment step is equivalent to 4 frames. The number of frames stacked per state was 4, the experiments were ran over 50 million steps, or 200 million frames as according to [6] and the episode terminated after 4500 steps as to match [14]. The number of ‘noop’ actions was a random number between 0 and 7 steps, or 028 frames. The agent was validated every 1 million steps and the score was averaged over 50 episodes. During validation the episode was terminated after 10,000 steps approximately 11 minutes of realtime play to avoid games such as MontezumaRevenge carrying on indefinitely if the agent is stationary.
For the Classic Control experiments the ran over 2 million steps, the agent was validated following the same procedure as the Atari experiments except validated every 40,000 steps.
7.2 Atari Experiments
Due to the computational complexity and limited time of this project, no hyperparameter search was completed for the Atari domains, and the hyperparameters were taken from existing work in the literature. All agents use a gradients calculated with a clipping by normalisation value of 0.5 [55] to help stabilise gradient updates. All figures show the mean score over 3 random starts, $\pm $ the standard deviation of the runs.
7.2.1 Baselines
The selected environments were all run for both the Synchronous $n$step Double DQN and the A2C architectures as to provide a baseline to compare the more advanced algorithms to. As [43] and [14] uses Adam optimisation, in order for the baselines A2C and SyncDDQN to be a fairer comparison, Adam is used for both of these agents, with $\u03f5=1\times {10}^{8}$ and ${\beta}_{m}=0.9,{\beta}_{v}=0.999$. Although, RMSProp is used in the original implementation [37].
Hyperparameter
Value
Optimiser
Adam
Learning rate
$1\times {10}^{3}$
Number of actors
32
Entropy coefficient
${0.01}^{1}$
Value coefficient
${0.5}^{2}$
$n$step period
${5}^{1}$
Discount factor $\gamma $
${0.99}^{1}$
Gradient norm clipping
${0.5}^{2}$
Hyperparameter
Value
Optimiser
Adam
Learning rate
$1\times {10}^{3}$
Number of actors
32
Target Network Period
$10,000$ steps ${}^{3}$
Initial $\epsilon $
${1}^{1}$
Final $\epsilon $
${0.01}^{1}$
${\epsilon}_{test}$
${0.01}^{1}$
$n$step period
${5}^{1}$
Discount factor $\gamma $
${0.99}^{1}$
Gradient norm clipping
${0.5}^{2}$
In the Atari experiments for the value based Double DQN the exploration rate $\epsilon $ was linearly annealed over 2 million steps to provide sufficient exploration.
7.2.2 $\lambda $Return and PPO
In order to better connect the future rewards to the value of the current state $V({s}_{t})$, the $n$step period was increased to 20 to match the [45] implementation, and to avoid having to manually fit the $n$step period for each experiment the $\lambda $return from equation 2.16 and GAE from equation 2.27 were used for the valuebased and policy based methods respectively, with a value of $\lambda =0.95$ taken from [22]. For the PPO experiments, the $n$step period was further increased to 128 to match the value used in [15] and [14].
Hyperparameter  Value 

Optimiser  Adam${}^{1}$ 
Learning rate  $1\times {10}^{4}$ 
Number of actors  $32$ 
Entropy coefficient  ${0.01}^{1}$ 
Value coefficient  $0.5$ 
$n$step period  ${128}^{2}$ 
Number of epochs  ${4}^{2}$ 
Number of minibatches  ${4}^{2}$ 
Discount factor $\gamma $  ${0.99}^{1}$ 
PPO clip range  ${[0.9,1.1]}^{2}$ 
Gradient norm clipping  ${0.5}^{3}$ 
7.2.3 Sparse Reward Solutions
The three sparse reward solutions being tested are, the UNREALA2C2 agent, the RND with PPO, and the novel RANDAL with PPO. All agents use the respective $n$step period from their base policies in section 7.2.2 and all use GAE to improve performance. Due to the significantly slower algorithm implementation and time constraints of the project, a curiosity driven agent using the ICM was not tested for the Atari experiments.
Hyperparameter  Value 
Optimiser  Adam${}^{1}$ 
Learning rate  $1\times {10}^{4}$ 
Number of actors  $32$ 
Entropy coefficient  $0.001$ 
Value coefficient  $0.5$ 
$n$step period  $128$ 
Number of epochs  $4$ 
Number of minibatches  $4$ 
Extrinsic advantage coefficient  2.0 
Intrinsic advantage coefficient  1.0 
Extrinsic discount factor ${\gamma}_{e}$  $0.999$ 
Intrinsic discount factor ${\gamma}_{i}$  $0.99$ 
PPO clip range  $[0.9,1.1]$ 
Gradient norm clipping  $0.5$ 
Reward prediction coefficient  $1$ 
Value replay coefficient  $1$ 
Pixel Control coefficient  $1$ 
Replay length per actor  $2000$ 
The UNREALA2C2 agent shows increased sample efficiency over the original implementation [45, Figure 6] for MontezumaRevenge achieving the score of 500 at around 25 million steps as opposed to 50 million in [45, Figure 6]. However, the UNREALA2C2 agent’s performance plateaus at this score whilst the original implementation increases approximately linearly. The difference between these two agents could due to one of many changes. As using GAE is shown to generally increase performance [22], the suspect list is narrowed down to the inclusion of the previous action and reward to the policy as well as the use of a LSTM network.
A clear parallel between these two methods can be seen when contrasting them from a
modelbased/modelfree viewpoint. Both RND and UNREAL agents can be said to utilise environment models to increase performance, as UNREAL agents learn a reward model $p({r}_{t}{s}_{t})$, unconditioned on the action i.e.
$$p({r}_{t}{s}_{t})=\sum _{a\in \mathcal{A}}p(a{s}_{t})({r}_{t}{s}_{t},a)=\sum _{a\in \mathcal{A}}p(a{s}_{t}){\mathcal{R}}_{s}^{a}$$  (7.1) 
and RND agents learn a random feature model of the environment which can be said to be learning a simplistic density model of regions explored by the agent. This interplay between modelbased and modelfree reinforcement learning, seems to show increased sample efficiency and performance over purely modelfree agents, again this is empirically supported by [56] as well as in Figure 7.3. This theory is also backed by biological plausibility as modelbased learning is said to be ‘ubiquitous’ in animal brains [57].
Qualitative analysis of trained agents show that all sparse reward solutions explore the same 3 rooms in MontezumaRevenge, and the difference in the score is that the RND and RANDAL agents learn to use a collected sword to kill any enemy resulting in a large score increase. An interesting observation was that after the agents had reached the maximum total score, that they were capable of achieving, the RND agent aimlessly wanders the surrounding rooms and the UNREALA2C2 and RANDAL agents spend the remainder of the episode transitioning between two rooms. These can be explained as the RND agent is trying to maximise the intrinsic reward for the rest of the episode and the UNREALA2C2 agent being ‘interested’ in the rooms transitions as they provide large pixel changes. The RANDAL agent shows a mix between these two behaviours, first moving back to the first room then for the remainder of the episode, it transitions between two rooms.
7.3 Classic Control Experiments
For the Classic Control Experiments the figures show the mean score $\pm $ the standard deviation for 5 different random starts.
7.3.1 Adam Hyperparameter Search
Initial results from the Classic Control environments using hyperparameters from Table 7.1 and Table 7.2 and Adam optimisation, showed relatively poor performance, so a small hyperparameter search was completed. For the A2C algorithm a random search was performed for 50 combinations of the entropy and learning rate with different random starts. The hyperparameters were sampled from a LogUniform(${10}^{3},1$) and LogUniform(${10}^{5},{10}^{2}$) for the entropy and learning rate respectively.
The learning rates in Figure 7.4 have a larger affect on performance than the entropy coefficient, this is more pronounced in the Acrobot results where a large range of entropy values provide good performance. However even large entropy regularisation wasn’t enough to for any the agents sampled to achieve any score in the sparse reward MountainCar task.
For the Classic Control SyncDDQN agents, the exploration rate was linearly annealed over 80,000 steps, as to be proportional to the Atari experiments i.e. linearly annealed over $1/25th$ of training. These were done to try to keep the hyperparameters consistent or proportional across all tasks as to reduce the number of variables that could be affecting the results. The agents are validated using a ${\epsilon}_{test}$ of value 0.01 for all SyncDDQN agents. The learning rate is sampled from a LogUniform($5\times {10}^{5},{10}^{2}$) distribution and the final $\epsilon $ is chosen to be either 0.1 or 0.01 with equal probability.
7.3.2 Adam vs RMSProp
For the A2C agent the learning rate and entropy coefficient were selected from the results shown in Figure 7.4, to perform well over both the CartPole and Acrobot environments. These values were selected to be $2\times {10}^{4}$ and $0.005$ for the learning rate and entropy coefficient respectively.
For the SyncDDQN agent the only values shared that perform well in Figure 7.5 for both CartPole and Acrobot. This is where the learning rate of $2\times {10}^{4}$ and final exploration value $\epsilon =0.1$. As mentioned In Figure 7.5, the annealing the learning rate over 80,000 steps does not seem to provide enough exploration, so the number of steps were increased, so that the $\epsilon $ was annealed from 1 to 0.1 over half of training i.e. 1 million steps.
The initial optimised Adam results, showed large dips in performance around the 500,000 step mark for both CartPole and Acrobot experiments, where the gradients have significantly overshot a local maximum, causing performance to drastically decrease. This was expected to be due to the momentum from the Adam optimiser causing the weights to be ‘slingshot’ further from the global maxima. To test this, the RMSProp optimiser which has no momentum was used to compare the results, for the RMSProp agents the learning rates and entropy coefficient hyperparameters taken from the literature, as seen in Table 7.1 and Table 7.2, and the $\u03f5=1\times {10}^{5},{\beta}_{v}=0.9$


Unlike the in Atari Freeway environment as seen in Figure 7.1, both algorithms are unable to successfully navigate the MountainCar environment in Figure 7.6. Even with sufficient exploration for the SyncDDQN agents the sparsity of the reward provides it very difficult for the agent to connect the series of actions to the reward received when reaching the target.
7.3.3 $\lambda $Return and PPO
For the SyncDDQN($\lambda $) and A2CGAE agents all hyperparameters were kept constant as increasing the step size decreased sample efficiency and performance, due to $n$=20 being much closer to the length of the episode ($200$ steps) in the Classic Control environments, compared to the Atari experiments. For simplicity and fairness, the PPO agents share many of the same hyperparameters of the A2C agents, and can be seen in Table 7.7
Hyperparameter  Value 

Optimiser  RMSProp 
Learning rate  $1\times {10}^{3}$ 
Number of actors  $32$ 
Entropy coefficient  $0.01$ 
Value coefficient  $0.5$ 
$n$step period  $5$ 
Number of epochs  $4$ 
Number of minibatches  $1$ 
Discount factor $\gamma $  $0.99$ 
PPO clip range  $[0.9,1.1]$ 
Gradient norm clipping  $0.5$ 
The results from Figure 7.7 can be explained by the ${R}^{\lambda}$ and GAE, being more complex functions to model, thus slightly decreasing the learning speed of the agents.
7.3.4 Sparse Reward Solutions
Due to the nature of the pixel control task, the UNREALA2C and RANDAL agents only use the value replay and reward prediction tasks. All PPObased agents (RANDAL, RND, ICM) use the same base policy hyperparameters as the Classic Control PPO agents seen in Table 7.7, with the ${\gamma}_{e}$ and ${\gamma}_{i}$ and other algorithm specific hyperparameters from Tables 7.4 and 7.6.
Surprisingly Figure 7.8, shows that auxiliary reward prediction and curiosity driven exploration are not enough for agents to sufficiently explore the MountainCar environment, this could be due to the normalisation techniques in curiosity agents or the intrinsic rewards are not enough to incentivise the agent to explore different methods of hill climbing before the parameters get stuck in a suboptimal stationary point.
The CartPole and Acrobot environments are clearly limiting benchmarks, easily solved by traditional agents, thus providing little information on the effectiveness on more complex algorithms. The MountainCar environment provides a difficult benchmark to reach for neural network parametrised agents. This is as neural networks often work better on higher dimensional data and require large amounts of data to converge compared to other models. The frequency of the reward is too small for these agents to sufficiently learn, another explanation is that the neural network agents are being stuck in a steep local stationary point where even after experiencing positive reward the update is not large enough to escape this point. One way to solve this problem would be to use a form of prioritised experience replay with large initial random exploration, to ensure the agent experiences enough rewards earlier on on the training.
Chapter 8 Conclusions and Further Work
Large performance gains in deep learning have often been attributed to increased computational power, however fundamental difficulties of sparse reward reinforcement learning have not been solved brute strength so far. Instead it has often relied on more intelligent exploration and exploitation algorithms, inspired by solutions found in nature, as reinforcement learning has had significant input from the fields of neuroscience and psychology.
This report has shown that the combination and interplay of modelbased and modelfree reinforcement learning techniques in nontraditional ways, often increases sample efficiency and performance. This report has also shown that specifically finding reinforcement learning solutions that work well across multiple sparse reward tasks, generally increases performance across all tasks.
8.1 Further Work
With the increased in performance of the UNREALA2C2, RND and novel RANDAL agents in sparse tasks, further work could be done to combine different aspects of these algorithms. For the UNREALA2C2 further investigation could be done to determine the compare the differences between this implementation vs the original, such as the use of GAE, Adam vs RMSprop optimisers, and the impact the feeding temporal information such as previous actions and rewards.
For the RANDAL agents a hyperparameter search could be done to optimise the weighting of the auxiliary tasks with the intrinsic reward. Combining intrinsic and extrinsic rewards for the reward prediction task is also an avenue for further exploration for the RANDAL architecture.
Following the successes of the interplay between modelbased and modelfree learning a proposed direction is to combine modelbased imagination used in [56] with intrinsic motivation and auxiliary tasks.
Bibliography
 [1] R. A. Wise and P.P. Rompre, “Brain dopamine and reward,” Annual Review of Psychology, vol. 40, no. 1, pp. 191–225, 1989. PMID: 2648975.
 [2] C. Sammut and G. I. Webb, eds., Samuel’s Checkers Player, pp. 881–881. Boston, MA: Springer US, 2010.
 [3] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al., “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, p. 354, 2017.
 [4] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation platform for general agents,” CoRR, vol. abs/1207.4708, 2012.
 [5] O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler, J. Agapiou, J. Schrittwieser, et al., “Starcraft ii: A new challenge for reinforcement learning,” arXiv preprint arXiv:1708.04782, 2017.
 [6] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller, “Playing atari with deep reinforcement learning,” CoRR, vol. abs/1312.5602, 2013.
 [7] R. Dubey, P. Agrawal, D. Pathak, T. L. Griffiths, and A. A. Efros, “Investigating human priors for playing video games,” in ICML, 2018.
 [8] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiositydriven exploration by selfsupervised prediction,” in ICML, 2017.
 [9] C. M. Bishop, Pattern recognition and machine learning. springer, 2006.
 [10] P. A. Gagniuc, Markov chains: from theory to implementation and experimentation. John Wiley & Sons, 2017.
 [11] C. C. White, Markov decision processesMarkov decision processes, pp. 484–486. Boston, MA: Springer US, 2001.
 [12] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. The MIT Press, second ed., 2018.
 [13] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
 [14] Y. Burda, H. Edwards, A. J. Storkey, and O. Klimov, “Exploration by random network distillation,” CoRR, vol. abs/1810.12894, 2018.
 [15] Y. Burda, H. Edwards, D. Pathak, A. J. Storkey, T. Darrell, and A. A. Efros, “Largescale study of curiositydriven learning,” CoRR, vol. abs/1808.04355, 2018.
 [16] D. Silver, “Lecture 2: Markov decision process.” http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf.
 [17] R. Bellman et al., “The theory of dynamic programming,” Bulletin of the American Mathematical Society, vol. 60, no. 6, pp. 503–515, 1954.
 [18] E. Vasilaki, “Lecture notes in adaptive intelligence,” Department of Computer Science, University of Sheffield, 2019.
 [19] D. Silver, “Lecture 4: Model free prediction.” http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MCTD.pdf.
 [20] C. J. C. H. Watkins and P. Dayan, “Qlearning,” Machine Learning, vol. 8, pp. 279–292, May 1992.
 [21] H. V. Hasselt, “Double qlearning,” in Advances in Neural Information Processing Systems 23 (J. D. Lafferty, C. K. I. Williams, J. ShaweTaylor, R. S. Zemel, and A. Culotta, eds.), pp. 2613–2621, Curran Associates, Inc., 2010.
 [22] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “Highdimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015.
 [23] R. Morris, “D.o. hebb: The organization of behavior, wiley: New york; 1949,” Brain Research Bulletin, vol. 50, no. 5, p. 437, 1999.
 [24] D. E. Rumelhart, G. E. Hinton, R. J. Williams, et al., “Learning representations by backpropagating errors,” Cognitive modeling, vol. 5, no. 3, p. 1, 1988.
 [25] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks, vol. 4, no. 2, pp. 251 – 257, 1991.
 [26] Y. LeCun, “A theoretical framework for backpropagation,” in Proceedings of the 1988 Connectionist Models Summer School (D. Touretzky, G. Hinton, and T. Sejnowski, eds.), (CMU, Pittsburgh, Pa), pp. 21–28, Morgan Kaufmann, 1988.
 [27] W. Hossack, “Fourier transform,” 2019.
 [28] I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, “Chapter 10  deep learning,” in Data Mining (Fourth Edition) (I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, eds.), pp. 417 – 466, Morgan Kaufmann, fourth edition ed., 2017.
 [29] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 315–323, 2011.
 [30] C. E. Shannon, “A mathematical theory of communication,” Bell system technical journal, vol. 27, no. 3, pp. 379–423, 1948.
 [31] J. Shore and R. Johnson, “Axiomatic derivation of the principle of maximum entropy and the principle of minimum crossentropy,” IEEE Transactions on Information Theory, vol. 26, pp. 26–37, January 1980.
 [32] J. Shore and R. Johnson, “Properties of crossentropy minimization,” IEEE Transactions on Information Theory, vol. 27, no. 4, pp. 472–482, 1981.
 [33] C. D. Sa, “Nonconvex optimization.” http://www.cs.cornell.edu/courses/cs6787/2017fa/Lecture7.pdf.
 [34] D. E. Rumelhart, G. E. Hinton, R. J. Williams, et al., “Learning representations by backpropagating errors,” Cognitive modeling, vol. 5, no. 3, p. 1, 1988.
 [35] G. Hinton, N. Srivastava, and K. Swersky, “Neural networks for machine learning lecture 6a overview of minibatch gradient descent,” Cited on, vol. 14, p. 8, 2012.
 [36] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [37] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” CoRR, vol. abs/1602.01783, 2016.
 [38] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu, “IMPALA: scalable distributed deeprl with importance weighted actorlearner architectures,” CoRR, vol. abs/1802.01561, 2018.
 [39] Y. Wu, E. Mansimov, S. Liao, R. B. Grosse, and J. Ba, “Scalable trustregion method for deep reinforcement learning using kroneckerfactored approximation,” CoRR, vol. abs/1708.05144, 2017.
 [40] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, Y. Wu, and P. Zhokhov, “Openai baselines.” https://github.com/openai/baselines, 2017.
 [41] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double qlearning,” CoRR, vol. abs/1509.06461, 2015.
 [42] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust region policy optimization,” CoRR, vol. abs/1502.05477, 2015.
 [43] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017.
 [44] M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. G. Azar, and D. Silver, “Rainbow: Combining improvements in deep reinforcement learning,” CoRR, vol. abs/1710.02298, 2017.
 [45] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu, “Reinforcement learning with unsupervised auxiliary tasks,” CoRR, vol. abs/1611.05397, 2016.
 [46] M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos, “Unifying countbased exploration and intrinsic motivation,” CoRR, vol. abs/1606.01868, 2016.
 [47] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba, “Hindsight experience replay,” CoRR, vol. abs/1707.01495, 2017.
 [48] “Cartpolev1 documentation.” https://gym.openai.com/envs/CartPolev1. Accessed: 20190825.
 [49] “Acrobotv1 documentation.” https://gym.openai.com/envs/Acrobotv1. Accessed: 20190825.
 [50] “Mountaincarv0 documentation.” https://gym.openai.com/envs/MountainCarv0. Accessed: 20190825.
 [51] S. Laue, M. Mitterreiter, and J. Giesen, “Computing higher order derivatives of matrix and tensor expressions,” in Advances in Neural Information Processing Systems, pp. 2750–2759, 2018.
 [52] Z. Wang, N. de Freitas, and M. Lanctot, “Dueling network architectures for deep reinforcement learning,” CoRR, vol. abs/1511.06581, 2015.
 [53] S. Aksoy and R. M. Haralick, “Feature normalization and likelihoodbased similarity measures for image retrieval,” Pattern Recognition Letters, vol. 22, no. 5, pp. 563 – 582, 2001. Image/Video Indexing and Retrieval.
 [54] Y. Burda, H. Edwards, A. J. Storkey, and O. Klimov, “Random network distillation.” https://github.com/openai/randomnetworkdistillation, 2019.
 [55] “Clip by normalised gradient.” https://www.tensorflow.org/api_docs/python/tf/clip_by_norm, 2019.
 [56] G. Kalweit and J. Boedecker, “Uncertaintydriven imagination for continuous deep reinforcement learning,” in Proceedings of the 1st Annual Conference on Robot Learning (S. Levine, V. Vanhoucke, and K. Goldberg, eds.), vol. 78 of Proceedings of Machine Learning Research, pp. 195–206, PMLR, 13–15 Nov 2017.
 [57] “The ubiquity of modelbased reinforcement learning,” Current Opinion in Neurobiology, vol. 22, no. 6, pp. 1075 – 1081, 2012. Decision making.