Dealing with Sparse Rewards in Reinforcement Learning

  • 2019-11-11 14:18:31
  • Joshua Hare
  • 0


Successfully navigating a complex environment to obtain a desired outcome isa difficult task, that up to recently was believed to be capable only byhumans. This perception has been broken down over time, especially with theintroduction of deep reinforcement learning, which has greatly increased thedifficulty of tasks that can be automated. However, for traditionalreinforcement learning agents this requires an environment to be able toprovide frequent extrinsic rewards, which are not known or accessible for manyreal-world environments. This project aims to explore and contrast existingreinforcement learning solutions that circumnavigate the difficulties of anenvironment that provide sparse rewards. Different reinforcement solutions willbe implemented over a several video game environments with varying difficultyand varying frequency of rewards, as to properly investigate the applicabilityof these solutions. This project introduces a novel reinforcement learningsolution by combining aspects of two existing state of the art sparse rewardsolutions, curiosity driven exploration and unsupervised auxiliary tasks.


Quick Read (beta)

University of Sheffield

[1.5cm] Dealing with Sparse Rewards in Reinforcement Learning


[1cm] Joshua Hare

[1cm] Supervisor: Eleni Vasilaki

[1cm] A report submitted in fulfilment of the requirements

for the degree of MSc in Advanced Computer Science

[0.3cm] in the

[0.3cm] Department of Computer Science

[2cm] September 11, 2019




All sentences or passages quoted in this report from other people’s work have been specifically acknowledged by clear cross-referencing to author, work and page(s). Any illustrations that are not the work of the author of this report have been used with the explicit permission of the originator and are specifically acknowledged. I understand that failure to do this amounts to plagiarism and will be considered grounds for failure in this project and the degree examination as a whole.

Name: Joshua Hare

Signature: Joshua Hare

Date: 15/05/2019


Successfully navigating a complex environment to obtain a desired outcome is a difficult task, that up to recently was believed to be capable only by humans. This perception has been broken down over time, especially with the introduction of deep reinforcement learning, which has greatly increased the difficulty of tasks that can be automated. However, for traditional reinforcement learning agents this requires an environment to be able to provide frequent extrinsic rewards, which are not known or accessible for many real-world environments. This project aims to explore and contrast existing reinforcement learning solutions that circumnavigate the difficulties of an environment that provide sparse rewards. Different reinforcement solutions will be implemented over a several video game environments with varying difficulty and varying frequency of rewards, as to properly investigate the applicability of these solutions. This project introduces a novel reinforcement learning solution by combining aspects of two existing state of the art sparse reward solutions, curiosity driven exploration and unsupervised auxiliary tasks.


[title=Summary of Notation]



Chapter 1 Introduction

Reinforcement learning is a multidisciplinary field combining aspects from psychology, neuroscience, mathematics and computer science, where an agent learns to interact with a environment by taking actions and receiving rewards. This is inspired from observations of chemical reward signals found in brains of humans and many other animal [1], that dictate the behaviours of these animals. Video games have been a common benchmark for reinforcement learning (RL) throughout the existence of RL. From Samuel’s checkers player [2] in 1959 to the game of Go [3], Atari-2600 games [4], to most recently Deepmind’s Starcraft II [5], which has over 108 possible available actions that the agent can take. The preference for Video games in RL, is easily understandable, as they provide an enclosed simulatory environment, often with easily accessible, quantifiable rewards that may not be available or difficult to measure in a real world environment. As it is common for many RL algorithms to take millions of training examples to effectively learn to interact with its environment, video games are useful in that they are computational simulations. This means that the environment is able to run faster than real-time, and many simulations can be performed simultaneously.

However, many modern approaches to RL such as Deep Q learning [6] require, as stated previously, millions of examples in order to optimise due to some fundamental difficulties of the task. One of these is that often RL agents have no prior knowledge of the environment in this case video games, as often video games are modelled on and include real world objects such as ladders, and physics such as gravity. From [7] it was shown that humans perform significantly worse when prior knowledge is masked, and similarly, reinforcement agents perform better when given prior information about the environment.

Another fundamental difficulty of RL is that of a sparse reward signal provided by an environment. A Sparse reward signal is series of rewards produced by the agent interacting with the environment, where most of the rewards received are non-positive. This makes it extremely difficult for reinforcement learning algorithms to connect a long series of actions to a distant future reward. For extremely sparse rewards, the agent may never find a reward signal, and thus will never learn how to perform a given task. Human agents are not hindered by sparse reward signals and in fact are capable of achieving tasks over a entire lifetime with little to no reward through intrinsic motivation [8], traditional RL algorithm’s learning is severely inhibited by sparse rewards.

This means that in order to automate a specific tasks with traditional reinforcement learning, the rewards received by the agent have to be carefully designed in order to maximise optimal behaviour. This is a problem as it requires that researchers and developers take time to fully understand the dynamics or optimal behaviour for an environment a priori. This takes time to implement, or worse the dynamics or optimal behaviour may unknown to the designers, so the rewards given to the agent may provide sub-optimal or non-functioning behaviour. By having an agent that can learn from sparse rewards, it allows designers to provide more abstract and long term goals, as well as being able tackle more complex tasks.

This project seeks to explore and contrast solutions to the inhibiting effect sparse rewards on RL agents, their implementations can be found on the following GitHub repository

1.1 Aims and Objectives

In this project the overall aims are:

  • Use reinforcement learning to successfully implement a working agent capable of playing video games

  • Explore and contrast existing reinforcement methods to deal with sparse reward signals

  • Implement existing solutions for sparse reward reinforcement learning on a series of increasing difficulty tasks

1.2 Overview of the Report

This report begins with a literature survey starting with chapter 2, which is a detailed introduction to reinforcement learning, specifically with the use of Markov Decision Processes. This chapter introduces some key terminology for RL and also introduces the mathematical framework used to evaluate and create RL algorithms as well the algorithms themselves. The literature survey continues on to introduce neural networks in chapter 3, finally a final chapter 4 on deep neural networks use in RL, including state of the art neural network implementations related to solving RL with sparse rewards.

Chapter 5 contains a brief description of the environments used to evaluate the agents implemented in this project. Following this is chapter 6 dedicated to a detailed explanation on the implementations of the RL agents tested in this project, as well as preliminary analysis on their expected performance. The results of the experiments performed are presented in chapter 7 and finally the conclusions of the project are found in chapter 8

Chapter 2 Reinforcement Learning Background

2.1 Markov Chains

A Markov chain is a probabilistic model that connects are series of states s in the set 𝒮 via transition probabilities, i.e. moving from one state to another [9]. A state is some representation of a model, this could be a known behaviour or variable, or an unobserved or partially observed hidden representation of a system (Hidden Markov Model). The transition probabilities between the states in the Markov Chain hold the Markov property, that is the transition between the next state of a system st+1 is only dependent on the current state st irrespective of the previous state st-1 [10].

p(st+1|s0,s1,,st)=p(st+1|st) (2.1)

2.2 Discrete Markov Decision Processes

A Discrete Markov Decision Process (MDP) is a decision making process characterised by a discrete state and time Markov chain. In a MDP the set of states 𝒮 are a representation of the environment e.g. game piece configuration on a board game [11]. In an MDP at each time step a decision is made by taking an action at from a set of actions available in state s, at𝒜s. Taking the action at produces a state transition modelled by the probability distribution 𝒫ssa=P(st+1=s|st=s,at=a) [12], following each state transition a real reward is received rt 11 1 The reward after transitioning from state st to st+1 is sometimes denoted as rt+1 [12]. The reasoning for the choice of rt is to be consistent with the modern deep reinforcement learning papers [6],[13],[14],[15], which use this notation. characterised by the reward model rt=sa=(rt|st=s,at=a).

2.3 Reinforcement Learning in Markov Decision Processes

Reinforcement learning is an optimisation process in which an agent learns a policy π (a strategy in which determines what decisions will be made), by interacting with an environment such that it maximises the expected future reward. For reinforcement learning in an MDP the optimisation process is to learn the optimal policy ππ* which maximises the cumulative expected future reward [12], known as the return22 2 Also referred to as Gt in [12] R. The return at time t, Rt is the expected future reward from state st onwards indefinitely for continuous tasks or until the final state sT for episodic tasks.

Rt=k=0rt+k (2.2)

A policy π maps the state s onto a probability distribution over the available actions in 𝒜s [12]. The probability for each action a𝒜s is denoted by the term π(a|s). Value functions provide an estimation of the value of being in state s or the value of taking an action a in s under a certain policy π. Commonly used in reinforcement learning are the value functions Vπ and Qπ, which denote the value and state-action value functions respectively.

The value function for a given state s under π is given by the expected future reward of from starting at s and following the policy π until the terminal state. Often a discount factor γ[0,1] is used dampen the reward signal from distant future actions so that the agent prioritises more immediate rewards (with the idea that immediate rewards are better than distant ones) and to deal with any uncertainty in the model [16]. Similarly the state-action value function is given by the expected future reward starting from s taking action a then following π onwards. The value function and state-action value function are given by:

Vπ(s) =𝔼π[Rt|st=s]=𝔼π[k=0γkrt+k|st=s] (2.3)
Qπ(s,a) =𝔼π[Rt|st=s,at=a]=𝔼π[k=0γkrt+k|st=s,at=a] (2.4)

In order to solve reinforcement learning, the optimal policy π* is to be found, which is to say to find the strategy which yields the highest reward over all the states. A policy π is considered better than another policy π if it’s value function is greater over all states [12], Vπ(s)>Vπ(s)sS. Therefore, this can be extended to define the optimal policy:

π*=argmaxπVπ(s)sS (2.5)

Since V* is the maximum expected reward from starting from state s then it must be the maximum of the expected reward over all possible available actions [12] V*(s)=maxa𝒜sQ*(s,a). Hence, the optimal policy can be written in terms of the optimal state-action value function:

π*=argmaxaQ*(s,a)sS (2.6)

2.4 Dynamic Programming

Dynamic programming is technique developed by Bellman to recursively solve a structural search problem (in this case a the optimal policy for a MDP), by storing the values of computed variables that reoccur during the recursive process [17]. From equations (2.3) & (2.4) it can be seen that the cumulative expected reward Rt=k=0rt+k is calculated at each time step t, which can be computationally demanding for large MDPs.
Using γ0=1 (2.3) can be expanded [18]:

Vπ(s) =𝔼π[rt+γrt+1+γ2rt+2+γKrt+K|st=s]Using Definition of Rt 2.2
=𝔼π[rt+γRt+1|st=s] (2.7)
conditioning on the next state s using law of total expectation
then condition on the action a
then using addition theorem of expectations
using the Markov property and definition of Reward model

Thus the value function can be expressed recursively with the reward and transition model:

Vπ(s)=aπ(a|s)s𝒮𝒫ssa[sa+γVπ(s)] (2.8)

A similar derivation can be done conditioning on the next state s and next action a or from (2.8) using:

Qπ(s,a)=sS𝒫ssa[sa+γa𝒜π(a|s)Qπ(s,a)] (2.9)

These are known as the Bellman equations for stochastic processes.

Dynamic program requires a known MDP in order to estimate the optimal value function hence update the policy. The policy can be updated iteratively by computing the value function of a policy π as according to the Bellman equations given above, then updating the policy using (2.5) or (2.6) depending on the value function used.


[title = Greedy Policy Iteration for State-Action Value Function] {algorithmic} \StateInitialise Q(s,a),π(s) \Repeat
1. Policy Evaluation \Repeat\Foreach s𝒮 \Foreach a𝒜 \StateQoldπ=Qπ \StateQπ(s,a)sS𝒫ssa[sa+γa𝒜π(a|s)Qπ(s,a)] \EndFor\EndFor\UntilQoldπ-Qπ<ϑ \CommentCheck For Convergence

2. Policy Update \Foreach s𝒮 \Stateπold(s)π(s) \Stateπ(s)=argmaxaQπ(s,a) \EndFor\Untilππold \CommentFound stable policy

Figure 2.1: Greedy policy iteration for state-action value function algorithm, adapted from [12]

2.5 Model Free Reinforcement Learning

Rather than learning the optimal policy for a known MDP structure, model-free reinforcement learning attempts to find the value function of being in state s directly from experience. This is a crucial step for applying reinforcement learning to real world problems, as most environments have an unknown MDP or too large state-action space to computationally model. The cost of this approach is the agent has a less efficient learning process, i.e. many more samples required to optimally converge.

2.5.1 Monte Carlo

In Monte Carlo reinforcement learning is a model free method which learns the value function for episodic tasks. An episodic task is a sequence of sequential experiences st,at,rt, that always have a terminal state i.e. from s0,,sT. Most games can be defined as episodic tasks, an example being a game of chess always has a terminal state (a final board-piece configuration) after the game has finished through a win or draw. The reason Monte Carlo reinforcement learning can only learn from episodic tasks is that it approximates the expectation of future rewards in the value functions (2.8) and (2.9) with an average of Rt for all states visited, over each episode. The future reward Rt can be determined online with dynamic programming however, the average state count can only be determined at the end of the episode. The average will tend to the true expectation as the number of samples approaches infinity due to the central limit theorem. Two common ways the average is calculated in Monte Carlo methods are the first visit evaluation and the every visit evaluation. In the first visit approach, the value function for a state s is calculated only once per sample (episode) and is averaged over all samples. The every visit approach averages the value of the state for every visit including multiple visits per episode [12]. The Value function can be expressed via V(s)=S(s)/N(s), where S(s) is the sum of sample value estimates from state s and N(s) is the number of times the agent has visited state s.
The value function can be updated iteratively in Monte Carlo learning, as the mean of a series can be computed iteratively. For a dynamic problem it may be useful to calculate the running mean whilst forgetting older episodes, this is by introducing a dynamic variable α [12]:

V(st)V(St)+α(Rt-V(St)) (2.10)

A disadvantage Monte Carlo method introduces is the inability to learn continuous tasks and the inability to update the value function before the end of an episode. This is not an issue for short time length episodes, but can greatly increase the time taken to find the optimal policy.

2.5.2 Temporal Difference

Temporal Difference combines the model free learning from the Monte Carlo method with the bootstrapping (expected future reward value estimation) from Dynamic Programming. This allows policy learning during episodes or continuous tasks, without the need to model the transition probabilities 𝒫ssa. In Temporal Difference learning, the optimal value function is found by iteratively updating the value function with the temporal error between the value predicted at V(st) and the future reward Rt=kγkrt+k similar to Monte Carlo evaluation (2.10). This is done by sampling from the expected future reward from (2.7) and using current value function estimate instead of the true value function, as seen in Dynamic Programming [12]. The temporal error is then:

δt=rt+γV(st+1)-Vst) (2.11)

V(s) can be iteratively updated using the temporal error δt, this is known as TD(0):

V(s)V(s)+α[rt+γV(st+1)-V(st)] (2.12)

Using the state-action value function in (2.12) results in the on-policy method commonly known as SARSA (state, action, reward, state, action)

Q(s,a)Q(s,a)+α[rt+γQ(st+1,at+1)-Q(st,at)] (2.13)

A bridge between Temporal Difference and Monte Carlo methods can be extended first with an n-step Temporal Difference error TD(n) which is defined as the temporal error between state V(st) and V(st+n) defined using the n-step future return Rt(n) [12]:

Rt(n)=rt+γrt+1++γn-1rt+n-1+γnV(st+n) (2.14)
δt(n)=Rt(n)-V(st) (2.15)

The optimal number of steps n depends on the length of the task and the number of states, finding a general solution to this problem is done by further extending the bridge to Monte Carlo is the TD(λ) approach. This is where all the expected rewards (2.14) are computed from n=0,T and combined geometrically via [19]:

Rtλ=(1-λ)n=1λn-1Rt(n) (2.16)

The λ approach can also be applied to SARSA, SARSA(λ).

Q Learning(an off-policy version of SARSA) was developed by Watkins [20]. Here instead of approximating the true state-action value function for a given policy Qπ(s,a) with the current estimate, the optimal state-action value function is approximated Q*(s,a).

Q(s,a)Q(s,a)+α[rt+γmaxaQ(st+1,a)-Q(st,at)] (2.17)

Q Learning suffers from a maximisation bias, that is that it overestimates the values in the state-action function Q(s,a) i.e. with a positive bias. A solution was proposed in [21] to use two state-action value functions, one to predict the next action to be taken and another to predict the value estimation of the action. (2.17) then becomes:

QA(s,a)QA(s,a)+α[rt+γQB(s,argmaxaQA(s,a))-QA(s,a)] (2.18)

Both QA(s,a) and QB(s,a) are used as the to select the action at time t and at each optimisation step either A or B are updated via random selection.

Q Learning also suffers from poor exploration as the approximation to optimal value is being used as the policy hence often limiting the exploration to states of known rewards. A solution to this is to use a decaying ε - greedy policy to enable initial exploration, then use optimal policy once the states have been sufficiently explored.

ε-greedy{at=randoma𝒜with probabilityεat=maxaQ*(s,a)with probability  1-ε (2.19)

2.5.3 Policy Gradient

Rather than learning the optimal policy through the maximisation over value function approximations 2.5 or 2.6 (which can be expensive for high dimensional state-actions spaces), the policy can be learnt directly under the condition that the policy is parametrised by a differentiable function. The policy is learnt by calculating the gradient of the current policy estimate π(a|s,θ) [12]. This allows us to learn, stochastic policies, continuous action policies and should theoretically provide more stable convergence properties [16] over value-based methods such as Q-Learning. However learning the policy directly can reduce the sample efficiency of the agent taking longer to converge to the optimal policy.
The policy gradient theorem shows that maximising the gradient of the value function is equivalent to maximising the gradient of the policy multiplied by the state-action value function all under the expectation of the policy [12]. This gives the gradient objective function J(θ) (with the objective of maximising the expected future reward) as

θJ(θ)=𝔼πθ[Qπ(s,a)θlogπ(a|s,θ)] (2.20)

Using the definition of Qπ(s,a) 2.4

θJ(θ)=𝔼πθ[Rtθlogπ(a|s,θ)] (2.21)

As Qπ(s,a) is not dependent on θ a baseline metric b(st) can be subtracted from equations 2.20 & 2.21 as long as it is constant w.r.t. a.

θJ(θ)=𝔼πθ[(Rt-b(s))θlogπ(a|s,θ)] (2.22)
Actor Critic

The Actor-Critic family of policy gradient methods involves combining the baseline policy gradient method with the bootstrapping method from temporal difference. For the Advantage Actor-Critic the baseline function used is the Advantage function

A(st,at)=Q(st,at)-V(st)=rt+γV(st+1)-V(st). (2.23)

Expressing the Advantage function in terms of only Vπ allows for the use of a single set of parameters to define the advantage function and reducing the dimensionality of the function being learnt. This has the benefits of being easier to learn as well as decreasing computational complexity, especially for large |𝒜| [22].

θJ(θ) =𝔼πθ[(rt+γV(st+1)-V(st))θlogπ(at|st,θ)]  where atπ (2.24)

By using V(s) as a baseline as well as to predict the TD error, value function provides the estimate of the quality of the action taken by the policy. Hence, it is said to be acting as a critic to the policy (the actor), thus giving the name Actor-Critic.

Generalised Advantage Estimation

Just as the TD n-step returns estimator can be combined in TD(λ), the advantage function estimation can be calculated similarly. Following 2.23, the n-step advantage estimator can be defined as

Atn=l=0n-1γlδt+l=l=0n-1γlrt+l+γnV(st+n)-V(st) (2.25)
Following the derivation from [22], Atλ can be derived by first exponentially averaging each n-step advantage estimate
Atλ =(1-λ)(At(1)+λAt(2)+λ2At(3)+)
Atλ =(1-λ)(δt+λ(δt+γδt+1)+λ2(δt+γδt+1+γ2δt+2)+)
Factoring by δt
Atλ =(1-λ)(δt(λ+λ2+)+γδt+1(λ2+λ3+)+)
Using the geometric series solution  k=0n-1ark=a(1-rn1-r)
Atλ =(1-λ)(δt(11-λ)+γδt+1(λ1-λ)+γ2δt+2(λ21-λ)+)


Atλ=l=0(λγ)lδt+l (2.27)

Chapter 3 Neural Networks

Artificial neural networks are neurologically inspired computational frameworks, which simulate neuron activity as a weighted summation over a series of connected input neurons. There are many different neural networks topologies that are commonly used, but most consist of a series of connected groups of neurons known as layers, which are fed stimulatory input through an input layer. An output layer (most often the final layer of the network) provides an estimation of some target value that is attempted to being modelled by the network. Neural networks were first developed the in the 1940s [23], however have seen a large increase in usage in recent years, due to increased computational power of modern computers. This allows larger, hence more powerful networks to be trained within a reasonable time frame.

3.1 Multi-Layer-Perceptron

Inspired by the perceptron algorithm Hinton et al created a feed-forward multi-layer perceptron (MLP) [24]. MLPs consist of distinct layers of nodes, where each node’s value is a summation of the previous layer l-1 with the synaptic weights of the current layer wijl under a non-linear activation function g. MLPs can approximate any function with an infinitely sized hidden layer [25] and can be trained to minimise a given loss function via the backpropagation algorithm [26]. Non-linear activation functions are a key process of the MLP’s ability to approximate any function and they are required to be differentiable in order to update the weights via backpropagation. The value of a node of a hidden layer l is given by:

hil=g(jwijlhjl-1)=g(Wlhl-1)  where Wn×m (3.1)

A loss function is defined by the final output layer y^ and the target value y given by the model (weights) θ, (y^,y|θ). As the output layer is a composite function the weights can be updated by differentiating the weights w.r.t the loss function using the chain rule.

wijl=zilzilwijl=hilhilzilzilwijl (3.2)

where z=jwijlhjl-1. As the loss w.r.t the hl is dependent on the next layer hil can be expressed recursively:

wijl =(jzjl+1zjl+1hil)hilzjlzilwijl Sincezilwijl=hil-1andhilzjl=g(zil)zjl
=(jzjl+1zjl+1hil)g(zil)zjlhil-1 (3.3)

3.2 Convolutional Neural Networks

Convolutional neural networks (CNN) are a visual cortex inspired neural network similar to the MLP as described above. They differ by the replacing the linear summation by a convolutional operation across the input. The convolution operation can be physically interpreted as an area overlap between two functions [27], in the context of neural networks it can be loosely interpreted as comparing a signal (image in video games) to a set of learnt filters which have a feature representation of different components of a signal. Using a convolutional layer allows the network to learn spatially or temporally invariant features across an input signal as the same filter is used across different sections of the input. In the context of images, this is equivalent to learning specific features, for example the features of a car regardless of the location of the car within the input. The convolution operation is defined by,

f(t)*g(t)=-f(τ)g(t-τ)𝑑τ For continuous functions (3.4)
x[m]*y[m]=n=-x[n]y[m-n] For discrete functions

The hidden layer component hijl for a 3D fully convolutional layer with a single filter is given by,

hijl=g(wijkl*hijl-1)=g(xyzwx,y,zlhi-x,j-y,k-zl-1) for a singular k×k×c filter wijkl. (3.5)

3.3 Recurrent Neural Networks

CNN and MLP are both examples of feed-forward neural network architectures. Specifically, these are both discrete acyclic graphs, meaning that each node in the graph is not connected to itself via any path. This reduces the capability of these networks for modelling sequential data, when the features xn are fed into the network at each time step t. This is because, in acyclic graphs there in no modelled relation between each pass of the network. Recurrent Neural Networks (RNN), solve this problem by combining the hidden state ht of the feature xt, with the hidden state ht-1 of the previous feature xt-1. The hidden layer at time t for a single-hidden layer RNN is given by the equation,

ht=g(Wxxt+Whht-1)  where Wxm×nWhm×m. (3.6)

The output yt is defined by an additional MLP layer, and can be produced every time-step in a many-to-many configuration or at the final time-step T in a many-to-one configuration. Another configuration includes that of one-to-many where the output yt-1 is fed into 3.6 instead of the previous hidden state ht-1.

As the hidden state ht hence output yt is conditioned of the previous step, the backpropagation algorithms flows backwards through the time steps Tt0, hence is known as backpropagation through time. However the drawbacks of this simple recurrent network is that of the vanishing gradients problem, where the gradients decrease in size from T to t0, hence simple RNNs can struggle to learn from long sequential data. This problem is reduced by the introduction of Long Term Short Memory (LSTM) recurrent networks which use multiple gates, in order to learn to extract more useful long-term important features.

3.4 Activation Functions

Many different non-linear activation functions are used for artificial neural networks and bring different strengths and weakness depending on the usecase. A brief description of each activation along with the relevant formula is given below.

The hyperbolic and logistic activation function families have been a popular choice throughout the history of neural networks. The two most common choices however are the tanh and sigmoid functions, these are given by the equations,

g(x)=tanh(x)=(ex-e-x)(ex+e-x) tanh (3.7)
g(x)=σ(x)=11+e-x sigmoid. (3.8)

The tanh and sigmoid function constrain the value of x to within the boundaries [0,1], [-1,1] respectively. This can lead to vanishing gradients [28] for deeper neural networks when training with backpropagation. This is because the gradient is restricted to between the values of [0,1], since backpropagation uses the chain rule, these gradients can quickly tend towards zero resulting in very small weight updates, as well as numerical error effects dominating the weight updates.

Rectified linear unit (ReLU) is perhaps the most popular activation function due to its simplicity, speed and often performance increase. It is given by the equation,

g(x)={0 for x0x for x>0  ReLU. (3.9)

The performance increase has often been attributed to the increase in sparsity of the connections that carry any non-zero value, this has been observed to be between 50%-80% in [29] for deep neural networks. Sparsity of neural connectivity is seen as a more biologically plausible model, as it is estimated that neurons in the human brain have a sparsity of between 90%-99% [29]. This intuitively makes sense as for random weight initialisation with mean zero should produce 50% sparsity, as the sets of positive and negative real numbers are equal in size. The vanishing gradient problem is negated as the gradient of the ReLU function is either 1 for positive x or zero otherwise

g(x)x={0 for x01 for x>0

, hence the gradients do not reduce in size over multiple layers. However the downside is that ReLU units can enter a value such that the neuron is never activated again, this is known as the dying ReLU problem and is solved by replacing the zero value in 3.9 by some function of x. For example it is replaced by αx in the leaky ReLU and a(ex-1) in the Exponential Linear Unit (ELU) functions.

g(x)={αx for x0x for x>0 leaky ReLU. (3.10)
g(x)={a(ex-1) for x0x for x>0 ELU. (3.11)

The softmax activation function produces a normalised probability distribution over each feature in an input vector xdzd, such that the probability of zi is is proportional to the exponent of the input value xi [9]. This is explicitly calculated by the formula,

g(xi)=exikdexk softmax. (3.12)

As the nature of this activation function this is generally used on the final layer of the network for multinomial logistic regression.

3.5 Loss Functions

In order to calculate the gradients a distance metric between the desired output yN and the actual output y^N must be defined. Mean squared error is a common distance metric which penalises the difference between two vectors by the mean of the squared distance between each feature

MSE=||y-y^||2=1Ni=0N(yi-yi^)2. (3.13)

A better comparison between two probability distributions than the mean squared error of the distributions is the cross entropy. Entropy is a measurement of uncertainty in a probability distribution. Entropy is highest when a distribution is uniform and lowest as the distribution approaches the Dirac delta function. For example the uncertainty is high for the outcome of a fair sided die roll, hence it has high entropy. For a discrete probability distribution entropy is given by,

H(x)=-ip(xi)logp(xi). (3.14)

Information theory states that a message contains high information when the outcome is unpredictable [30], as you have gained information about the outcome. Information is lowest when the outcome is known and predictable, as you already had the information about what the outcome would be. Entropy then can be seen as the amount of information in a signal. The Kullback-Leibler (KL) divergence can be seen as the amount of information needed to encode the information in the probability distribution p with another distribution q [31], it is lowest when q=p and hence can be seen as an error between these two probability distributions. Formally this is the expected log difference between the two distributions,

DKL(p||q)=𝔼x[logp-logq]=ip(xi)logp(xi)-p(xi)logq(xi)=ip(xi)log(p(xi)q(xi)). (3.15)

However this is not a true distance metric because it is not symmetrical DKL(p||q)DKL(q||p), nor does it satisfy the triangle equality [32].

The second term in equation 3.15, -ip(xi)logq(xi) is also known as the cross entropy between p and q. For a fixed p, for example the labels for a classification task, minimising the cross entropy is equivalent to minimising the KL-divergence. Minimising both the mean squared error and cross entropy are equivalent to maximising the likelihood of the data given the parameters [9].

3.6 Optimisation Methods

As backpropagation is a gradient based optimisation process, that minimises the parameters in some loss space L. Using gradient based optimisation allows the parameters to approach at least a local minimum for convex problems as well as stationary points for non-convex problems [33]. As neural networks require very large amount of data points compared to other Machine Learning algorithms, updating the gradient based on the average over all data points in a data set is impracticable. Thus, Stochastic Gradient Descent (SGD) methods are often used to increase training speed.

Using SGD methods is especially important as all data points cannot be accessed at any time, as it is being continually generated by the agent. SGD, however requires heavy tuning in order to find a optimal learning rate η for different tasks, such that the parameters do not get stuck in local minima and preferable one that converges with as few updates as needed. Because of this modified SGD algorithms are often used throughout machine learning, especially with neural networks.

Figure 3.1: Visual representation of stochastic gradient optimisers

3.6.1 Momentum based SGD

In order to reduce stochasticity of the SGD updates, the concept of momentum of gradient updates was first used in [34]. Here the analogy is that of a particle rolling down a hill, it experiences a force pulling on it which increases its momentum. When the gradients are steeper the particle accelerates, and as the hill flattens it decelerates. This stabilises the direction of the gradient updates as the momentum makes it less susceptible to large deviations caused by a single minibatch update. If the optimiser has a large enough momentum it can cause it to escape suboptimal local minima. However, too large momentum can cause the optimiser to overshoot minima thus being slower to optimally converge, as seen in Figure 3.1.

3.6.2 RMSProp

Root Mean Squared Propagation (RMSProp) [35], adapts the magnitude of the update for each parameter by dividing by a running average of the root mean squared gradient v=W¯2, this allows different features to be learnt at different scales, thus reducing the need to finely tune the learning rate. For a neural network layer weighting W, the update rule can be written as

vtβvt-1+(1-β)(JWt)2 (3.16)

Where β is the decay rate of the average.

3.6.3 Adam

Adaptive Moment Estimation (Adam) [36] combines the adaptive magnitude updates from RMSProp 3.16 with momentum based gradient descent [36], with bias correction for momentum and adaptive moment (mean square of the gradient)

mtβmmt-1+(1-βm)(JWt)  momentum (3.17)
vtβvvt-1+(1-βv)(JWt)2  adaptive moment
m^tmt/(1-βmt)  bias correction for momentum
v^tvt/(1-βvt)  bias correction for adaptive moment

Where βmt,βvt are the decay rates with exponent of the number of updates t, and ϵ is a small scalar value to prevent division by zero [36].

Chapter 4 Deep Reinforcement Learning

In video games it is common for the set of states 𝒮 to be given as a pixel array of the game screen sh×w×c, for a single colour frame of size 84×84 there are 842×(255)31×1011 different image combinations. Most video games will likely only use a tiny fraction of this image space, but it is still impractical to model the value functions for each state separately. A solution is to use a differentiable function to model V(s), this is known as value function approximation [12]. Deep Reinforcement Learning refers to using a deep neural network as a non-linear value function or policy approximator.

4.1 Traditional Deep Reinforcement Learning

4.1.1 Deep Q Learning

The Deep Q learning algorithm proposed by [6] is essentially the Q learning algorithm described previously (2.17), with two additional key features that stabilise the policy iterations for the use of neural networks. The two additional features are experience replay, and fixed target policy iteration. If using a gradient based iterative optimisation process as done via back-propagation in neural networks, it is important that the data is independent and identically distributed (i.i.d.). This is done as to avoid sampling bias from correlated inputs, which can cause the gradient to get stuck in a non-optimal local maxima (as we are performing gradient ascent to maximise the expected future reward). Experience replay is a method to help decorrelate the sequential experiences gained from dynamic programming and model free reinforcement learning methods. This is done by storing experiences, a tuple of st,at,rt,st+1 into a list of experiences known as the replay memory. Batch samples can be drawn randomly from the replay memory which provide i.i.d. for a large replay length. Deep Q Learning also solves the partially observable MDP problem by providing the input of the neural network with 4 concatenated previous states (greyscaled images for CNN).
The Deep Q Learning algorithm is expressed below :


[title = Deep Q Learning Algorithm with Experience Replay and ϵ - greedy Policy for Atari Environments] {algorithmic} \StateInitialize replay memory 𝒟 to capacity N \StateInitialize action-value function Q with random weights \Statek - number of frames to stack \Forepisode 1M \States1={x1,x1,}  stack initial frame k-times \Fortime tT \StateWith probability ε select a random action at \Stateotherwise at=maxaQ*(st,a|θ) \StateExecute action at in emulator and observe reward rt and image xt+1 \Statest+1={xt-k+1,,xt,xt+1}and add experience to 𝒟st,at,rt,st+1 \StateSample random minibatch of transitions sj,aj,rj,sj+1 from 𝒟 \StateSet yj={rj for terminal sj+1rj+γmaxaQ(sj+1,a|θ) for non-terminal sj+1
\StateθiLi(θi)=𝔼π[(r+γmaxaQ(s,a|θi-1)-Q(s,a|θi))θiQ(s,a|θi)|s,a,s] \StateEvery C steps θi-1=θi

Figure 4.1: Deep Q Learning algorithm, adapted from [6]

4.1.2 Advantage Actor Critic

Asynchronous Advantage Actor-Critic (A3C) [37] is an algorithm where multiple actors interact and learn from their own environment asynchronously. Each actor has it’s own local policy parameters which it uses to calculate the gradients of the policy and value function over n-steps, this is done using TD(n). The gradients calculated by each actor are asynchronously applied to shared global parameters, and after each n-step rollout the local parameters are updated from the global parameters. Running multiple environments allows the agent to learn from many uncorrelated experiences, thus reducing the impact on-policy training has on local optimal convergence. Thus, is seen as a replacement for an experience replay buffer. The greatest benefit of the A3C and other asynchronous method is the reduced training time over single actor algorithms, especially when using distributed computing [38].

A2C [39] is a synchronous version of A3C, where multiple actors interact synchronously using a single shared policy-network. Running these environments synchronously removes the need to correct for any off-policy correction and allows for greater GPU utilisation.


[title = n-step Synchronous Advantage Actor-Critic (A2C) Algorithm] {algorithmic} \StateRandomly initialize policy π and value function V with parameters θ, θv respectively \StateCreate Ne environments \Staten - number of steps to calculate TD over \StateRn - n-step return \Repeat\Foractor 1Ne \Fortime tt+n \StateSample actions atπ(a|st,θ) \StateExecute actions at and observe rewards rt and next states st+1 \EndFor\EndFor
\StateRtn={0 for terminal stV(st|θv) for non-terminal st   Bootstrap from last state 
\StateRtn={rt for terminal strt+γRt+1n for non-terminal st \EndFor\StateθL(θ)=𝔼π,eNe[θlogπ(a|s,θ)(Rn-V(s|θv)+βHθH(π(s,θ))]
\StateθvL(θv)=𝔼π,eNe[(Rn-V(s|θv))2θv] \Untiltmax

Figure 4.2: A2C algorithm adapted from [37] and [40]

4.1.3 Synchronous DDQN

In [37] the asynchronous multi-actor framework was also extended to include versions which use state-action value functions, e.g. SARSA and Q-Learning. Synchronous n-step Double DQN (Sync DDQN), is a synchronous version of the async DQN algorithm proposed in [37], with Double DQN parametrisation following [41]. The network used to provide the value of action was a fixed copy of the online network θ. The online network was used to select the action at under the epsilon greedy policy. The value used to bootstrap the estimated future return from 4.2 is then replaced by Q(st,argmaxaQ(st,a|θi)|θi-1), where θi-1 is a periodic copy of θ updated every C steps similar to 4.1.

4.1.4 Proximal Policy Optimisation

Trust Regions

In reinforcement learning the training data is dynamically produced by the agent. This means too large policy update can significantly alter the behaviour of the agent. Too large an update can cause the agent to move into parameter space where it converges on sub-optimal behaviour or worse, it no longer receives any rewards from the environment. A safer method to update the policy is to stick within a trusted region of the current policy as to not cause too dramatic updates. In Trust Region Policy Optimisation TPRO, [42] the surrogate objective function for a specific state, action pair st,at is the policy scaled by the policy of this pair w.r.t. the old policy parameters θold, penalised by the KL-divergence between the old policy and the new policy. This is given by the equation

maximize𝜃𝔼t[π(at|st,θ)π(at|st,θold)At-βKL(π(|st,θold)||π(|st,θ))][43]. (4.1)

β is a constant that controls the strength of the KL-divergence penalty. The original derived quantity leads to a too small an update [42], thus requires tuning for each task, or requires further complexity by adaptively changing the strength throughout training [43]. In [42] they overcome this by optimising the surrogate objective 𝔼t[π(at|st,θ)π(at|st,θold)At] under the constraint that the KL-divergence is less than some scalar value δ, instead of being penalised by it. This requires second order optimisation methods introducing even further complexity.


Proximal Policy Optimisation PPO [43], is a first order optimisation method that is used to calculate a lower-bound for the policy update. This is done in the form of clipping the policy ratio. The minimum value between the unclipped surrogate policy objective, and the clipped surrogate policy objective is chosen to perform the update on θ.

maximize𝜃𝔼t[min(π(at|st,θ)π(at|st,θold)At,clip(π(at|st,θ)π(at|st,θold), 1-ϵ, 1+ϵ)At)][43]. (4.2)

PPO is optimised over several epochs for each n-step sample, no theoretical justification is given in [43] for this. This is a problem for policy gradient methods as after the first epoch the updated parameters are longer the parameters that produced the current training data, and thus further epochs are off-policy updates. The use of multiple updates could be justified by claiming the parameter updates are small enough as they constrained by the old policy and the clipping bound, such that the new policy is never too far from the old policy. Thus, reducing the chances of sub-optimal convergence.


[title = Proximal Policy Optimisation (PPO) Algorithm] {algorithmic} \StateRandomly initialize policy π and value function V with parameters θ, θv respectively \StateCreate Ne environments \Staten - number of steps to calculate TD over \Stateθoldθ \Repeat\Foractor 1Ne \Fortime tt+n \StateSample actions atπ(a|st,θ) \StateExecute actions at and observe rewards rt and next states st+1 \EndFor\EndFor\StateCompute Advantages A and Returns R \Forepoch 1,,K \Forminibatch 1,,N \Statesample minibatch transitions randomly without replacement (sj,aj,Aj,Rj) \setstretch1.5 \Stater(θ)=π(aj|sj,θ)π(aj|sj,θold) \StateθL(θ)=𝔼[θmin(r(θ)Aj,clip(r(θ), 1-ϵ, 1+ϵ)Aj)+βHθH(π(s,θ))] \StateθvL(θv)=𝔼π[(Rn-V(s|θv))2θv] \EndFor\EndFor\Stateθoldθ \Untiltmax

Figure 4.3: Proximal Policy Optimisation algorithm adapted from [43]

As stated previously the Q Learning algorithm suffers from a maximisation bias and poor exploration. To improve on [6], methods such as TD(n) and Double Q Learning as previously explained can be implemented along with additional reinforcement learning methods such as Duelling Q Networks, A3C and prioritised experience replay. These provide significant gains on the initial DQN algorithm as shown in [37] and [44]. Although, even after combining all these deep reinforcement learning methods, deep reinforcement learning still suffers from sampling inefficiency requiring up to 50 million state samples (200 million frames) in order to to converge on an policy equivalent to human-level performance in Atari-2600 games. Furthermore, all of these improvements fail to achieve near human level performance in sparse extrinsic reward games such as MontezumaRevenge or PrivateEye. 200 million frames is equivalent to playing 38 days in real-time, playing at 60fps. As stated earlier this is extremely inefficient learning when contrasted against humans who are often able to perform well in both dense and sparse reward games after a single or few attempts.

4.2 Deep Reinforcement Solutions to Sparse Extrinsic Rewards

4.2.1 Curiosity Driven Exploration via Next-State Prediction

Curiosity driven exploration in deep reinforcement learning as seen in [8] attempts to solve the lack of frequent rewards by introducing a intrinsic reward function. The intrinsic reward function provides rewards when the agent experiences novel states. In order to avoid the agent from seeking states of high stochasticity which the agent cannot model e.g. leaves fluttering in a breeze [8], the agent learns an exploration model only for features that the agent can interact with. The intrinsic reward is calculated by intrinsic curiosity model (ICM) in [8]. The ICM consists of two models the inverse dynamics model which tries to predict the action at given an encoded version of states ϕ(st) and ϕ(st+1) and the forward model that tries to predict the next encoded state ϕ(st+1) using the actual action taken by the agent at and the encoded state ϕ(st). In [15] the intrinsic reward it is given by the mean squared error it=|ϕ^(st)-ϕ(st+1)|2. By jointly optimising the inverse and forward model, the agent gains no benefit of trying to model any part of the environment that isn’t affected by the action taken [8]. However, this introduces a problem as parts of the environment that are not immediately affected by the agent’s policy, may be important in the future [15].

4.2.2 Unsupervised Auxiliary Tasks

The UNsupervised REinforcement and Auxiliary Learning (UNREAL) [45] agent increases the sample efficiency of the A3C agent by extracting additional reward signals from the environment.

Figure 4.4: Overview of the UNREAL agent, image taken from [45] with authorial permission. This figure shows the auxiliary tasks (b) Pixel Control, (c) Reward Prediction and (d) Value Function Replay for the UNREAL agent.

One additional task is that of pixel control (PC), where an auxiliary policy Qaux is trained to maximise the pixel intensity across different sections of the image input over an n-step period as seen in Figure 4.4(b). The auxiliary state-action function is trained by minimising the mean-squared error of the between the predicted n-step pixel change given as

LQaux=𝔼[(Rtn+γnmaxaQaux(s,a|θi-1)-Qaux(s,a|θi))2][45]. (4.3)

The other auxiliary tasks are that of reward prediction (RP), where the agent tries to predict the reward received only for the next step given three preceding frames Figure 4.4(c), and finally the value function replay (VR) which further trains the value function V(st) from a replay buffer 4.4(d). All three auxiliary tasks are trained off-policy where input state sequences are sampled from a small experience replay buffer. As the sequences sampled have different starting points to the online rollouts, this allows the agent to learn new temporal relations between states [45]. Here the three auxiliary tasks use the agents’ existing neural network to jointly optimise the network for all tasks. The total loss function is then,


The auxiliary tasks do not directly affect the policy π, but rather shape the features in the encoder network which could then affect the agent’s behaviour. However, Unlike the curiosity driven exploration the pixel control task learns to maximise the expected change in pixels, this could possibly indirectly lead to the agent seeking states of high stochasticity if LQaux is too large, as it will bias the encoded features towards those that change the pixel intensity the most.

4.2.3 Random Network Distillation

The novelty of a state can be also measured by the number of counts the agent has been in that state, in [46] this is combined with intrinsic motivation by providing an additional reward signal which is inversely proportional to the count. This method was simplified in [14] via using a Random Network Distillation (RND). RND consists of a base policy πθ, as well as randomly initialised target and predictor networks f and f^ respectively. The target network is fixed after initialisation and the predictor network is trained to predict the output of the target network for the next state st+1. The intrinsic reward is defined as the mean squared error between the two outputs it=|f^(st+1)-f(st+1)|2 and the predictor network is trained by minimising this loss [14]. By using a lower dimensional representation this should reduce the effect of the agent seeking states of high stochasticity as mentioned earlier.

Minimising the distance between f and f^ implicitly learns the count of the number times a state has been visited, as the predictor network cannot predict the target network output with greater than random probability without first visiting the state. As the target network f output is deterministic i.e. constant, this means the output is learnable and the loss can approach at least a local minima through gradient descent. This is notably different to a forward dynamics model, where for certain non-stochastic environments such as simplistic physics based environments the next state can be predicted without necessarily having to have explicitly visited that state, as the forward dynamics model learns the true transition model 𝒫ssa. This is likened to the agent getting bored in [8], as the outcome becomes more predictable hence less novel.

4.2.4 Hindsight Experience Replay

Auxiliary Tasks and curiosity driven exploration both attempt to explore unknown states by maximising the statistical novelty of a state compared to states the agent previously experienced. This only will work if there is a variation amongst the states being explored, however in some tasks the states lack diversity hence the ICM and pixel control policies will not sufficiently be able to explore the state space [47]. This is especially crucial for large state spaces with very few rewards. The proposed solution uses every episode the agent performs to provide a reward even if the agent does not achieve the said goal. This is done by creating virtual goals which use the final the state the agent ended up in as the desired outcome. An example given in [47] is using a robotic arm to moving an object to a specific coordinate (x,y). For a large x,y plane randomly sampling a specific location would be statistically negligible, so after every episode the agent updates it’s value function by ”pretending” that the final (x,y) coordinate was the desired outcome. Hence, the agent learns to move the object to any specified coordinate.

Chapter 5 Evaluation and Testing

5.1 Environments

In order to test the efficacy of the algorithms discussed above, a suitable MDP environment must be used. Several environments were used to provide a variety of environment complexities and task difficulties, but crucially a difference in sparsity of the extrinsic reward received. The environments include that of classic control games {CartPole, Acrobot, MountainCar} and a small selection of Atari games {SpaceInvaders, Freeway, MontezumaRevenge} were chosen to meet the above requirements, as well as their popularity so that there were plenty of reference results in the literature. All environments were executed using OpenAI’s gym toolkit.

5.1.1 Classic Control

These simplistic environments represent the state of the environment with a vector sn, with each feature corresponding to a physical variable of the system that the agent needs to learn to control in order to gain rewards. The CartPole-v1 game consists of an unstable inverted pendulum (pole) attached to a cart and the goal is to keep the pole upright. The agent can move the cart left or right via applying a discrete force of ±1 along the horizontal axis of the cart. The game finishes when the pole moves greater than 15°either side of the upright position, moving ±2.4 units from the centre or after 500 steps. The agent receives a reward +1 for each step until the episode is finished [48].

The Acrobot-v1 environment consists of a double pendulum in which the goal is to try to swing the tip of the lower pendulum to a specific target height. The agent can move the pendulum by applying a torque of {+1, 0, -1} to the joining link. Every step the agent receives a negative reward other than the target state [49].

The final control task is MountainCar-v0, where the goal is to reach the top of the mountain from the bottom of the valley. However, the car does not have enough power to directly drive up the side, therefore in order to solve the problem, the agent must gain momentum from oscillating either up slide of the mountain. The agent receives a -1 reward for every step the agent is not at the top and the episode terminates at 200 steps. Therefore the agent will receive a -200 reward unless it learns to reach the top in under 200 moves [50].

At first glance these as all these tasks contain a dense extrinsic reward signal as they produce a non neutral at each step, so one might expect that they should be solved easily. However, under further inspection in Acrobot and MountainCar tasks, the rewards contain very little information about the state-action value, as all actions are considered equally bad regardless of how close the agent is to the optimal behaviour. An agent must sufficiently explore the state space in order to achieve the a higher reward, hence learn from it.

5.1.2 Atari 2600

For the Atari 2600 games, OpenAI provides wrapper to the Arcade Learning Environment (ALE). The ALE produces a 210x160x3 RGB image for each frame, however many frames are duplicates or are very similar so it has been a common practise since [6] to only consider the nth successive frames, repeating the same action n times. This helps reduce the computational cost by concatenating n steps in the MDP onto a single step. Another preprocessing step performed is to convert the images to greyscale and scale and crop the image to a size 84x84. The final preprocessing step for the Atari environments is to concatenate the current frame with the previous k frames into a single image. This solves the partially observable MDP problem, providing additional temporal information into the state (such as velocity of the ball in pong) that would not be present in a static frame. This allows the use of convolutional neural networks to be used as the non-linear state-value function approximator.

As the environment is deterministic a random amount of ‘no-op’ (neutral) actions were taken at the start of each episode, this changes the initial starting state, and is done to prevent the agent ‘memorising’ good actions from frequently visited target states. Sometimes an loss of life, done or terminal flag is used to increase the agent’s learning efficiency and, performance. This means when bootstrapping for the estimate of the expected future return, a loss of life is treated as the terminal state. The reasoning being that using the flag will allow the agent to quickly learn losing a life is to be avoided. Another commonly used practise is to cap the maximum number of steps an agent can take in the environment before the episode terminates. This penalises idle behaviour, thus should make the agent more efficient. The exact environment parameters are given in 7.1

In SpaceInvaders a series of waves of aliens zig-zag down the screen firing at the agent, the agent’s goal is to shoot the aliens before they reach a certain threshold height. The game increases in difficulty as the aliens increase their speed of descent as their numbers go down. a life is lost when the aliens reach the threshold and the game is ended when all 3 lives are lost or the player has defeated all waves of aliens.

In the game of Freeway the goal is for the player to reach the other side of the road without being hit by the oncoming traffic. The player receives a reward of +1 when reaching the other side and 0 otherwise. In the game MontezumaRevenge the goal is to explore temple rooms and collect amulets. Navigating the different rooms requires the collecting objects such as keys found throughout the level, as well as stringing together a series of complex movements to avoid falling and traps.

The games were chosen for varying sparsity of rewards as according to [46], SpaceInvaders provide a human-optimal environment with relatively dense reward feedback and Freeway and MontezumaRevenge provide the sparse reward environments with easy and difficult search respectively.

Chapter 6 Implementation and Analysis

6.1 Software and Hardware

Python was the language of choice for implementing these algorithms as it is widely used in the reinforcement learning community, such as OpenAI’s baselines [40] allowing for better troubleshooting and debugging. Also, Python is one the most common used API for neural network packages, such as the popular Tensorflow and Pytorch packages. All neural networks were implemented using Tensorflow’s python API. Tensorflow is a Machine Learning framework that allows users to build graphical models using ‘tensors’, Tensorflow’s internal representation of a generalised matrix 11 1 Note that these tensors are distinct in notation and properties from the tensors in tensor calculus [51], and provides built-in automatic differentiation, which automatically handles gradient calculation and backpropagation. And all the experiments were ran on either a single machine with a 8-core, 16-thread AMD RyzenTM 1800x with a single NVIDIA® GeForce® RTX 2070 GPU and 48GB of DDR4 RAM, or, on a single GPU-node on the Sheffield Advanced Research Computer (ShARC) high performance computing system, which contained a NVIDIA® Tesla K80 with a Intel® XeonTM E5-2630 v3 CPU. All references to computational speed and timing are those observed using the single machine setup.

6.2 Scalable Reinforcement Learning

In order to perform the Atari experiments over 200 million frames, it is crucial that the implementation use multiple workers to run the environments, these methods scale number of steps taken in the environment per second sub linearly. Many different scalable frameworks were attempted such as A3C and IMPALA [38], however due to Tensorflow’s incompatibility with python’s multiprocessing library these methods required multithreading, or using Tensorflow’s distributed training framework. Due python’s Global Interpretable Lock, multithreading is not truly parallel, only concurrent and attempted implementations using Tensorflow’s distributed training for a single machine were relatively slow compared to results shown in the literature [38]. The A2C framework, however was observed to increase the number of steps per second by a factor of 10, from 250 to 2300-2700 steps per second. The A2C framework reduces the need or multiple copies of the network by running a single step in multiple environment in parallel. The implementation of running multiple environments synchronously in parallel was adapted from OpenAI’s implementation [40], and achieved this by running an individual environment in a separate process using python’s multiprocessing library. Having the environments run synchronously reduces the overhead of passing gradients between processes as well as the need to account for off-policy corrections. However, the most crucial benefit is the increased efficiency of the use GPU when using a single machine. This is as value estimations can be ran in a single batch decreasing the time spent of transferring data onto the GPU, which is a large bottleneck for GPU applications.

6.3 Encoder Neural Networks

All implementations used neural networks to encode the state st onto a high dimensional feature vector, which then would be used to predict the value function of policy distribution from. In order to be consistent across experiments, every agent used the same encoder networks architectures.

For the Classic Control tasks the encoder network was two MLP layers of with 64 hidden units each, under the ReLU activation function.

For the Atari experiments a CNN encoder network was used, as in [13], as it is a commonly used to benchmark to test different algorithms on. This CNN is parametrised as following, first a 2D convolutional layer with filter size of [8,8] of with 32 output channels and stride [4,4], following a 2D convolutional layer with filter size [4,4], 64 output channels and stride [2,2]. A final convolutional layer with 64 output channels, filter size [3,3] and stride [1,1]. The resulting convolutional layer was flattened to a vector and a final fully connected MLP layer with 512 hidden units. All layers were under the ReLU activation function, and all convolutional layers had the ’VALID’ padding. The observational input range for the policy CNN encoder is scaled from [0,255] to [0,1] for each pixel by diving by 255 to avoid to large ReLU activations as it is unbounded in the positive direction equation 3.9.

Both encoder networks use the Glorot uniform weight initialisation and zero bias initialisation.

6.4 A2C

For the Advantage-Actor Critic A2C network, The actor and the critic share the same encoder network with the value function and unnormalised log probabilities (logits) being linear projections (via a fully connected MLP layer with no activation) from the encoded feature vector. To get the policy distribution π, the softmax activation function was applied to the logits, resulting in a policy distribution over the discrete set of available actions. The gradient of the entropy over the policy distribution βHθH(π(s,θ)) is added to the loss function following [37] in order to provide regularisation of the policy such that is doesn’t converge on a sub optimal policy. By maximising the entropy, the uniformity of the probability distribution increases, this should increases exploration as previously unlikely actions are sampled with greater frequency. The strength of the entropy maximisation is controlled by the parameter βH [37].

6.5 Synchronous DDQN

For the Synchronous n-step DDQN network, the state-action value function was a linear projection from the encoder network. Here double Q-learning was used to reduce maximisation bias.

6.6 PPO

The PPO implementation uses the exact same model as the A2C model mentioned above. A separate network for the old policy parameters θold is not used, instead the policies for each rollout are stored and then passed into the network during backpropagation.

6.7 Unsupervised Auxiliary tasks

6.7.1 Original Implementation

A3C Policy

In [45] the A3C agent is parametrised by an encoded CNN-LSTM network, which takes in an RGB colour image of size [84,84] as it’s input state st for time t. It then encodes this onto a feature vector using a small two layer CNN following the original DQN paper implementation [6]. The CNN-encoded state is concatenated with the previous reward and action is then fed into a LSTM layer with 256 hidden units, from this the policy and value function are projected as described in the A2C implementation.

Pixel Control

For The pixel reward for each time step the image at time t is cropped to a the central region of size 80×80 pixels, and subdivided into 20×20 cells of size 4×4. The pixel reward for time t is then defined by the absolute difference between the mean over all colour channels and all pixels for each cell, of the input image st and the previous image st-1.

In order to produce the Qaux estimate of expected pixel change from each subdivided cell region, a MLP layer with relu is added the LSTM output layer, this is then reshaped to size 32×7×7. A deconvolutional layer with 32 output channels, filter size [3,3] and stride one produces a 32×9 feature map. Two deconvolutional layers with output channels 1, number of actions are applied to the 9×9 feature map concurrently, both with filter size [2,2]. This produces a value function and advantage function estimate which is used to produce the Qaux estimate via duelling neural networks given by [45]

Qaux(s,a|θ)=ReLU(V(s|θ)+A(s,a|θ)-1|𝒜|a𝒜A(s,a|θ))[52]. (6.1)
Reward Prediction

For the reward prediction a reward rt is sampled from the experience replay in a skewed manner such that the probability mass for non-zero rewards is equal 0.5. The three proceeding images [st-2,st-1,st] 22 2 as rt is reward from stst+1 are passed into the CNN encoder, the flattened output features across all three images are concatenated into a single feature vector. This is then passed into a MLP layer with 128 hidden units with the ReLU activation function. This is then projected via another MLP layer onto a categorical distribution using the softmax function. The categories of this distribution are positive negative and neutral rewards, which should be easier to learn than the exact value of rt, this gradient is calculated via cross entropy loss.

Value Replay

A small experience replay buffer stores the most recent 2000 states for each actor, the gradient of the value function replay task is calculated via mean squared error in the same fashion as the base A3C agent.

6.7.2 Custom A2C-CNN implementation (UNREAL-A2C2)

A2C Policy

In order to be a more accurate comparison to the base A2C agent and for greater GPU utilisation, an CNN-only A2C version of [45] is implemented. In order to get this to work the following changes were made, first the states are converted to greyscale and stacked to the match base A2C and Sync-DDQN implementations. In order to be more comparable to the A2C experiments the action-reward concatenation to the encoded state before the value and policy was not performed, as this provides extra temporal information that may improve performance.

Pixel Control

The pixel control reward at each time step is then altered to be defined as the absolute mean difference across the greyscale subdivided region of the most recent image xt in the concatenated state st=[xt-3,xt-2,xt-1,xt]


Other changes, made to the pixel control task were to not perform the cropping operation and subdivide the image into 21×21 cells. The only change this requires to the deconvolutional network is to increase the number of size of the MLP layer before the first feat map from size 32×7×7 to 32×8×8. Finally the greyscale images were normalised between the values of (0,1) using a simple min-max normalisation [53] in order to avoid manually fitting the λPC term per environment. However, this does introduce a small non-stationarity affect into calculating the pixel rewards as the maximum pixel value of the dataset may increase over the course of training. This is less pronounced for than it would be for per-channel RGB min max normalisation, as well as being more stationary that per-pixel z-score image normalisation. This is because in episodic video games tasks many start in a specific location, and when progressing to different areas often means a change in environment textures, thus possibly changing the normalisation parameters significantly. Whereas the change in maximum pixel intensity for a greyscaled images is expected to be negligible.

Reward Prediction

The reward prediction task is the same as the original implementation as described above but, without the frame concatenation as the state already provides necessary temporal information. Instead of sampling rewards from all workers, the worker with the highest reward is greedily chosen to reduce the chance of sparse rewards going unmissed, as well as increasing computational efficiency.

Because of the noticeable modifications to the original UNREAL agent’s algorithm, to distinguish between the two algorithms, the custom agent used is hence referred to as the UNREAL A2C-CNN or UNREAL-A2C2 agent.

6.8 RND

6.8.1 Intrinsic Reward and Value Function

Following [14], The base policy is a PPO policy with an additional separate critic is used to calculate the value of the intrinsic rewards Vi(s). The weighted combination of the extrinsic and intrinsic advantages is used to update the policy, and the discount factor for the extrinsic reward is increased to 0.999 as to match [14]. The intrinsic rewards are non-episodic, as [15] argues that this increases exploration, and they are normalised so that the scale is independent of the observational space [14]. This is achieved by dividing the intrinsic reward by the running estimate of the standard deviation of the intrinsic return. However the intrinsic rewards are non-episodic so the true return cannot be calculated as it goes on indefinitely. It is also non-stationary as it is produced by a changing parametrised function, so using the intrinsic value estimate to approximate the return is suspected to be an unstable normalisation technique. So the normalisation used in [54] is used, where the intrinsic reward is normalised by the standard deviation of the inversely discount rewards as the past rewards are calculable. This can be explicitly defined as,


In practise the running estimate of the standard deviation is updated after every n-step rollout and intrinsic return is calculated online.

6.8.2 Target and Predictor networks

As the target network is fixed, it cannot adjust to the scale of the observational inputs, thus normalisation techniques are required in order ensure the output scale is consistent across different environments [14]. The input to the predictor and target networks is the final image xt in the state st=[xt-3,xt-2,xt-1,xt], this is then normalised by subtracting the running estimate of the per-pixel mean and dividing it by the running estimate of the per-pixel standard deviation and finally being clipped between the range -5, 5. However, This introduces non-stationarity to target network output thus further non-stationarity to the intrinsic reward. The running estimates of the observational mean and standard deviations are initialised by running a random agent for 50 rollouts for each actor and continually updated during training.

The target and predictor network are both separate models from the policy as to ensure bias free reward prediction, they both use the same convolutional encoder architecture as the policy network.

The target network has a final MLP layer of size 512 with no activation after the encoder. For the predictor network after the convolutional encoder it has two successive MLP layers of size 512, both with the ReLU activation followed by a final MLP layer with size 512 with no activation. The target and predictor networks use a leaky ReLU activations for the convolutional layers, as well as all layers being initialised with orthogonal initialisation with gain 2. This was not done to increase performance but to ensure the output of the target and prediction networks were of the same magnitude as the [54] implementation, so that the hyperparameters taken from [14] are valid.

6.9 ICM

For curious agents using the next-state prediction, the ICM model is used to provide the additional reward for the PPO policy with separate intrinsic value function. The ICM is parametrised by additional separate encoder network using the same architecture and weight initialisation as the base policy encoder, but replacing the observational scaling with the observational normalisation as described in the RND implementation. The output from the encoder ϕ(st) is concatenated the with action at and fed into the forward model. The forward model consists of one fully connected MLP layer with ReLU activation, with the number of hidden units the same size of the encoded state, i.e. 512 and 64 for the CNN and MLP encoder respectively. A following MLP with no activation and same size of the previous layer creates the predicted next state ϕ^(st+1) [8].

For the inverse model the encoded state and next state ϕ(st) and ϕ(st+1) are concatenated into a single feature vector and passed into a MLP layer with ReLU activation and size 512 and 64 for the CNN and MLP respectively. This is fed into a final MLP layer with no activation that maps onto the unnormalised logit probability distributions over the available actions producing at^. The inverse model is trained as a classification task using cross entropy between the predicted action at^ and actual action at taken at time t, as well the forward model being trained via mean squared error |ϕ^(st+1)-ϕ(st+1)|2 [8]. Unlike the [15] implementation the forward model and the inverse model gradients are both passed back to the encoder model as in the original implementation [8], with weighting βLforward+(1-β)Linverse, with β=0.2.

6.10 Random Network Distillation with Auxiliary Learning RANDAL

RAndom Network Distillation with Auxiliary Learning (RANDAL) is a novel method of combining curiosity driven exploration with RND and auxiliary tasks was implemented as following. A base PPO policy with RND is combined with all auxiliary tasks as described in UNREAL-A2C2 implementation, in order to stop the intrinsic reward from decreasing too rapidly and thus exploring less, a separate intrinsic value function is used as in the RND agent. This allows the value replay be performed exclusively on the extrinsic value function, preserving the rate of change in it. As the intrinsic reward is non-stationary, the reward prediction is done only on the sign (+,-,0) of the extrinsic reward, this is done in hopes to increase stability as extrinsic rewards are stationary as well as being able to continue to use the simplified categorical reward rather than the exact reward value. The target and predictor networks are identical to the RND implementation.

6.11 Analysis

For the baseline architectures of Sync-DDQN and A2C, it is expected that A2C agents shall outperform Sync-DDQN agents as it is a policy gradient method. As previously mentioned policy gradient methods provide are theorised to have more stable convergence properties, thus should find a more optimal than the value based Sync DDQN agents. However, the A2C algorithm is expected to perform worse on sparse reward tasks as it suffers for poor exploration, e.g. A2C will probably never learn to navigate MountainCar, Freeway or MontezumaRevenge. Again this is because it is an on-policy algorithm using the estimate of the optimal policy π(a|s,θ) to sample actions from each turn. The expectation of the Sync DDQN agents is that they should slightly underperform compared to the A2C for dense reward tasks but should perform better for tasks that benefit from random exploration, i.e. should outperform A2C on easy search sparse reward tasks, for the chosen environments this includes the Freeway, Acrobot and MountainCar tasks.

PPO agents should outperform A2C and Sync-DDQN agents on all tasks, as the multiple epoch minibatch training should ensure sparse rewards do not get lost in the gradient update, as well as the value function estimate converging to the true value of the policy quicker than the single update algorithms. This reiteration of the value function is comparable to the value function replay in the UNREAL architecture, thus the performance of the PPO agent should be closer to the UNREAL agents than the A2C and Sync-DDQN agents.

For the curious agents, the expectation is that the ICM and RND agents should perform equally well for high dimensional complex inputs, such as image observations in Atari games. This is because the forward model will not converge too quickly, providing similar amounts of rewards as the RND agents. For low dimensional non-complex environments, ICM agents are expected the magnitude of the intrinsic rewards is expected to decrease quickly as the observations are easily predictable, even for unseen states.

Chapter 7 Results and Discussion

7.1 Testing and Validation

For the Atari experiments the number of frames skipped (actions repeated) was 4, meaning each environment step is equivalent to 4 frames. The number of frames stacked per state was 4, the experiments were ran over 50 million steps, or 200 million frames as according to [6] and the episode terminated after 4500 steps as to match [14]. The number of ‘no-op’ actions was a random number between 0 and 7 steps, or 0-28 frames. The agent was validated every 1 million steps and the score was averaged over 50 episodes. During validation the episode was terminated after 10,000 steps approximately 11 minutes of real-time play to avoid games such as MontezumaRevenge carrying on indefinitely if the agent is stationary.

For the Classic Control experiments the ran over 2 million steps, the agent was validated following the same procedure as the Atari experiments except validated every 40,000 steps.

7.2 Atari Experiments

Due to the computational complexity and limited time of this project, no hyperparameter search was completed for the Atari domains, and the hyperparameters were taken from existing work in the literature. All agents use a gradients calculated with a clipping by normalisation value of 0.5 [55] to help stabilise gradient updates. All figures show the mean score over 3 random starts, ± the standard deviation of the runs.

7.2.1 Baselines

The selected environments were all run for both the Synchronous n-step Double DQN and the A2C architectures as to provide a baseline to compare the more advanced algorithms to. As [43] and [14] uses Adam optimisation, in order for the baselines A2C and Sync-DDQN to be a fairer comparison, Adam is used for both of these agents, with ϵ=1×10-8 and βm=0.9,βv=0.999. Although, RMSProp is used in the original implementation [37].
Hyperparameter Value Optimiser Adam Learning rate 1×10-3 Number of actors 32 Entropy coefficient 0.011 Value coefficient 0.52 n-step period 51 Discount factor γ 0.991 Gradient norm clipping 0.52 Table 7.1: A2C Atari baseline hyperparameters, taken from 1[37], 2[40] Hyperparameter Value Optimiser Adam Learning rate 1×10-3 Number of actors 32 Target Network Period 10,000 steps 3 Initial ε 11 Final ε 0.011 εtest 0.011 n-step period 51 Discount factor γ 0.991 Gradient norm clipping 0.52 Table 7.2: Synchronous Double DQN Atari baseline hyperparameters, taken from 1[37], 2[40], 3[41]

In the Atari experiments for the value based Double DQN the exploration rate ε was linearly annealed over 2 million steps to provide sufficient exploration.

Figure 7.1: Atari Baselines, results show a mean validation score ± standard deviation over 3 random parameter initialisations run of the respective models for the number of steps in millions. The figure shows that the policy gradient A2C leads to better convergence for dense reward tasks however suffers from poor exploration and thus cannot learn to solve sparse reward tasks. This is contrasted against the Sync-DDQN which can solve the easy search sparse task Freeway with random search, however fails to learn from the more complex MontezumaRevenge sparse reward task.

7.2.2 λ-Return and PPO

In order to better connect the future rewards to the value of the current state V(st), the n-step period was increased to 20 to match the [45] implementation, and to avoid having to manually fit the n-step period for each experiment the λ-return from equation 2.16 and GAE from equation 2.27 were used for the value-based and policy based methods respectively, with a value of λ=0.95 taken from [22]. For the PPO experiments, the n-step period was further increased to 128 to match the value used in [15] and [14].

Hyperparameter Value
Optimiser Adam1
Learning rate 1×10-4
Number of actors 32
Entropy coefficient 0.011
Value coefficient 0.5
n-step period 1282
Number of epochs 42
Number of minibatches 42
Discount factor γ 0.991
PPO clip range [0.9,1.1]2
Gradient norm clipping 0.53
Table 7.3: PPO Atari hyperparameters, taken from 1[43], 2[14] 3[40]
Figure 7.2: Atari experiments for PPO with GAE vs A2C with GAE vs Sync-DDQN(λ). The combination of increased n-step period and use of λ return, significantly increase performance in the dense reward environment over the initial TD(5) return however, show no performance gains for the sparse reward tasks. Surprisingly the PPO agent does not outperform the A2C and Sync-DDQN agents on the easy search sparse reward task, where it was expected to outperform on all three tasks according to the results obtained in [43].

7.2.3 Sparse Reward Solutions

The three sparse reward solutions being tested are, the UNREAL-A2C2 agent, the RND with PPO, and the novel RANDAL with PPO. All agents use the respective n-step period from their base policies in section 7.2.2 and all use GAE to improve performance. Due to the significantly slower algorithm implementation and time constraints of the project, a curiosity driven agent using the ICM was not tested for the Atari experiments.

Hyperparameter Value Optimiser Adam1 Learning rate 1×10-4(1) Number of actors 32 Entropy coefficient 0.0011 Value coefficient 0.5 n-step period 1281 Number of epochs 41 Number of minibatches 41 Extrinsic advantage coefficient 2.01 Intrinsic advantage coefficient 1.01 Extrinsic discount factor γe 0.9991 Intrinsic discount factor γi 0.991 PPO clip range [0.9,1.1]1 Gradient norm clipping 0.5 Table 7.4: RND Atari hyperparameters, taken from 1[14] Hyperparameter Value Optimiser Adam Learning rate 1×10-3(1) Number of actors 32 Entropy coefficient 0.0011 Value coefficient 0.51 n-step period 201 Discount factor γ 0.991 Gradient norm clipping 0.5 Reward prediction coefficient   11 Value replay coefficient   11 Pixel Control coefficient 1 Replay length per actor 20001 Table 7.5: UNREAL-A2C2 Atari hyperparameters taken from 1[45]
Hyperparameter Value
Optimiser Adam1
Learning rate 1×10-4
Number of actors 32
Entropy coefficient 0.001
Value coefficient 0.5
n-step period 128
Number of epochs 4
Number of minibatches 4
Extrinsic advantage coefficient 2.0
Intrinsic advantage coefficient 1.0
Extrinsic discount factor γe 0.999
Intrinsic discount factor γi 0.99
PPO clip range [0.9,1.1]
Gradient norm clipping 0.5
Reward prediction coefficient 1
Value replay coefficient 1
Pixel Control coefficient 1
Replay length per actor 2000
Table 7.6: RANDAL Atari hyperparameters
Figure 7.3: Atari experiments for RND vs UNREAL-A2C2 vs RANDAL agents. All agents are superior in all environments over the traditional reinforcement learning agents and are equal in the dense reward task. However, the RND and RANDAL excel in the sparse reward tasks significantly outperforming the UNREAL-A2C2 agent in both the easy and hard search, sparse reward environments. RANDAL shows a less stable mean score than the RND, with large spikes in performance in successive validation scores, however shows a lower variance between the runs for both sparse reward tasks.

The UNREAL-A2C2 agent shows increased sample efficiency over the original implementation [45, Figure 6] for MontezumaRevenge achieving the score of 500 at around 25 million steps as opposed to 50 million in [45, Figure 6]. However, the UNREAL-A2C2 agent’s performance plateaus at this score whilst the original implementation increases approximately linearly. The difference between these two agents could due to one of many changes. As using GAE is shown to generally increase performance [22], the suspect list is narrowed down to the inclusion of the previous action and reward to the policy as well as the use of a LSTM network.

A clear parallel between these two methods can be seen when contrasting them from a
model-based/model-free viewpoint. Both RND and UNREAL agents can be said to utilise environment models to increase performance, as UNREAL agents learn a reward model p(rt|st), unconditioned on the action i.e.

p(rt|st)=a𝒜p(a|st)(rt|st,a)=a𝒜p(a|st)sa (7.1)

and RND agents learn a random feature model of the environment which can be said to be learning a simplistic density model of regions explored by the agent. This interplay between model-based and model-free reinforcement learning, seems to show increased sample efficiency and performance over purely model-free agents, again this is empirically supported by [56] as well as in Figure 7.3. This theory is also backed by biological plausibility as model-based learning is said to be ‘ubiquitous’ in animal brains [57].

Qualitative analysis of trained agents show that all sparse reward solutions explore the same 3 rooms in MontezumaRevenge, and the difference in the score is that the RND and RANDAL agents learn to use a collected sword to kill any enemy resulting in a large score increase. An interesting observation was that after the agents had reached the maximum total score, that they were capable of achieving, the RND agent aimlessly wanders the surrounding rooms and the UNREAL-A2C2 and RANDAL agents spend the remainder of the episode transitioning between two rooms. These can be explained as the RND agent is trying to maximise the intrinsic reward for the rest of the episode and the UNREAL-A2C2 agent being ‘interested’ in the rooms transitions as they provide large pixel changes. The RANDAL agent shows a mix between these two behaviours, first moving back to the first room then for the remainder of the episode, it transitions between two rooms.

7.3 Classic Control Experiments

For the Classic Control Experiments the figures show the mean score ± the standard deviation for 5 different random starts.

7.3.1 Adam Hyperparameter Search

Initial results from the Classic Control environments using hyperparameters from Table 7.1 and Table 7.2 and Adam optimisation, showed relatively poor performance, so a small hyperparameter search was completed. For the A2C algorithm a random search was performed for 50 combinations of the entropy and learning rate with different random starts. The hyperparameters were sampled from a LogUniform(10-3,1) and LogUniform(10-5,10-2) for the entropy and learning rate respectively.

Figure 7.4: Classic Control A2C hyperparameter search, results show the Adam learning rate vs entropy with the colour representing the average of the last 5 validation scores of the agent, as this includes a measure of stability of the agent. It can be seen that learning rates that are optimal for both CartPole and Acrobot lie between the values of 2×10-4 and 4×10-4 and optimal entropy lies between 5×10-3 and 2×10-2.

The learning rates in Figure 7.4 have a larger affect on performance than the entropy coefficient, this is more pronounced in the Acrobot results where a large range of entropy values provide good performance. However even large entropy regularisation wasn’t enough to for any the agents sampled to achieve any score in the sparse reward MountainCar task.

For the Classic Control Sync-DDQN agents, the exploration rate was linearly annealed over 80,000 steps, as to be proportional to the Atari experiments i.e. linearly annealed over 1/25th of training. These were done to try to keep the hyper-parameters consistent or proportional across all tasks as to reduce the number of variables that could be affecting the results. The agents are validated using a εtest of value 0.01 for all Sync-DDQN agents. The learning rate is sampled from a LogUniform(5×10-5,10-2) distribution and the final ε is chosen to be either 0.1 or 0.01 with equal probability.

Figure 7.5: Classic Control Sync-DDQN hyperparameter search, results show the Adam learning rate vs final ε value over 30 samples, with the colour representing the average of the last 5 validation scores of the agent. The results are less clear than 7.4, however it can be seen that both the CartPole and MountainCar agents suffer from poor exploration, as higher performance is seen for the higher final ε value for CartPole and the MountainCar agents perform poorly, when they were expected to be able solve the task through sufficient random search.

7.3.2 Adam vs RMSProp

For the A2C agent the learning rate and entropy coefficient were selected from the results shown in Figure 7.4, to perform well over both the CartPole and Acrobot environments. These values were selected to be 2×10-4 and 0.005 for the learning rate and entropy coefficient respectively.

For the Sync-DDQN agent the only values shared that perform well in Figure 7.5 for both CartPole and Acrobot. This is where the learning rate of 2×10-4 and final exploration value ε=0.1. As mentioned In Figure 7.5, the annealing the learning rate over 80,000 steps does not seem to provide enough exploration, so the number of steps were increased, so that the ε was annealed from 1 to 0.1 over half of training i.e. 1 million steps.

The initial optimised Adam results, showed large dips in performance around the 500,000 step mark for both CartPole and Acrobot experiments, where the gradients have significantly overshot a local maximum, causing performance to drastically decrease. This was expected to be due to the momentum from the Adam optimiser causing the weights to be ‘slingshot’ further from the global maxima. To test this, the RMSProp optimiser which has no momentum was used to compare the results, for the RMSProp agents the learning rates and entropy coefficient hyperparameters taken from the literature, as seen in Table 7.1 and Table 7.2, and the ϵ=1×10-5,βv=0.9

\thesubsubfigure Optimised Adam Agents
\thesubsubfigure RMSProp Agents
Figure 7.6: Classic Control baselines Adam vs RMSProp for A2C and Sync-DQQN agents. The results show that for low dimensional RL problems, RMSProp optimisation is significantly more stable and has faster convergence than Adam. The results also show that the policy gradient method is more stable than the value based method as the rewards fluctuate less over the course of training, matching up with the theory. However an unexpected result is the MountainCar environment which was expected to be solvable for the Sync DDQN through increased exploration over the initial results in Figure 7.5.

Unlike the in Atari Freeway environment as seen in Figure 7.1, both algorithms are unable to successfully navigate the MountainCar environment in Figure 7.6. Even with sufficient exploration for the Sync-DDQN agents the sparsity of the reward provides it very difficult for the agent to connect the series of actions to the reward received when reaching the target.

7.3.3 λ-Return and PPO

For the Sync-DDQN(λ) and A2C-GAE agents all hyperparameters were kept constant as increasing the step size decreased sample efficiency and performance, due to n=20 being much closer to the length of the episode (200 steps) in the Classic Control environments, compared to the Atari experiments. For simplicity and fairness, the PPO agents share many of the same hyperparameters of the A2C agents, and can be seen in Table 7.7

Hyperparameter Value
Optimiser RMSProp
Learning rate 1×10-3
Number of actors 32
Entropy coefficient 0.01
Value coefficient 0.5
n-step period 5
Number of epochs 4
Number of minibatches 1
Discount factor γ 0.99
PPO clip range [0.9,1.1]
Gradient norm clipping 0.5
Table 7.7: PPO Classic Control hyperparameters
Figure 7.7: Classic Control A2C-GAE vs Sync-DDQN(λ) vs PPO. The results show that the GAE slightly decreases sample efficiency for the A2C Acrobot agents, and Sync-DDQN CartPole agents. PPO, show greater sample efficiency and stability than all A2C and Sync-DDQN agents, achieving a perfect score from 0.5 million steps onwards for the CartPole environment.

The results from Figure 7.7 can be explained by the Rλ and GAE, being more complex functions to model, thus slightly decreasing the learning speed of the agents.

7.3.4 Sparse Reward Solutions

Due to the nature of the pixel control task, the UNREAL-A2C and RANDAL agents only use the value replay and reward prediction tasks. All PPO-based agents (RANDAL, RND, ICM) use the same base policy hyperparameters as the Classic Control PPO agents seen in Table 7.7, with the γe and γi and other algorithm specific hyperparameters from Tables 7.4 and 7.6.

Figure 7.8: Classic Control experiments for RND vs UNREAL-A2C2 vs RANDAL agents. The results surprisingly show no significant gain over the PPO agents, and the UNREAL-A2C2 Acrobot results show the importance of the affect that bad initialisation can have on deep RL agents.

Surprisingly Figure 7.8, shows that auxiliary reward prediction and curiosity driven exploration are not enough for agents to sufficiently explore the MountainCar environment, this could be due to the normalisation techniques in curiosity agents or the intrinsic rewards are not enough to incentivise the agent to explore different methods of hill climbing before the parameters get stuck in a suboptimal stationary point.

The CartPole and Acrobot environments are clearly limiting benchmarks, easily solved by traditional agents, thus providing little information on the effectiveness on more complex algorithms. The MountainCar environment provides a difficult benchmark to reach for neural network parametrised agents. This is as neural networks often work better on higher dimensional data and require large amounts of data to converge compared to other models. The frequency of the reward is too small for these agents to sufficiently learn, another explanation is that the neural network agents are being stuck in a steep local stationary point where even after experiencing positive reward the update is not large enough to escape this point. One way to solve this problem would be to use a form of prioritised experience replay with large initial random exploration, to ensure the agent experiences enough rewards earlier on on the training.

Chapter 8 Conclusions and Further Work

Large performance gains in deep learning have often been attributed to increased computational power, however fundamental difficulties of sparse reward reinforcement learning have not been solved brute strength so far. Instead it has often relied on more intelligent exploration and exploitation algorithms, inspired by solutions found in nature, as reinforcement learning has had significant input from the fields of neuroscience and psychology.

This report has shown that the combination and interplay of model-based and model-free reinforcement learning techniques in non-traditional ways, often increases sample efficiency and performance. This report has also shown that specifically finding reinforcement learning solutions that work well across multiple sparse reward tasks, generally increases performance across all tasks.

8.1 Further Work

With the increased in performance of the UNREAL-A2C2, RND and novel RANDAL agents in sparse tasks, further work could be done to combine different aspects of these algorithms. For the UNREAL-A2C2 further investigation could be done to determine the compare the differences between this implementation vs the original, such as the use of GAE, Adam vs RMSprop optimisers, and the impact the feeding temporal information such as previous actions and rewards.

For the RANDAL agents a hyperparameter search could be done to optimise the weighting of the auxiliary tasks with the intrinsic reward. Combining intrinsic and extrinsic rewards for the reward prediction task is also an avenue for further exploration for the RANDAL architecture.

Following the successes of the interplay between model-based and model-free learning a proposed direction is to combine model-based imagination used in [56] with intrinsic motivation and auxiliary tasks.


  • [1] R. A. Wise and P.-P. Rompre, “Brain dopamine and reward,” Annual Review of Psychology, vol. 40, no. 1, pp. 191–225, 1989. PMID: 2648975.
  • [2] C. Sammut and G. I. Webb, eds., Samuel’s Checkers Player, pp. 881–881. Boston, MA: Springer US, 2010.
  • [3] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al., “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, p. 354, 2017.
  • [4] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation platform for general agents,” CoRR, vol. abs/1207.4708, 2012.
  • [5] O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler, J. Agapiou, J. Schrittwieser, et al., “Starcraft ii: A new challenge for reinforcement learning,” arXiv preprint arXiv:1708.04782, 2017.
  • [6] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller, “Playing atari with deep reinforcement learning,” CoRR, vol. abs/1312.5602, 2013.
  • [7] R. Dubey, P. Agrawal, D. Pathak, T. L. Griffiths, and A. A. Efros, “Investigating human priors for playing video games,” in ICML, 2018.
  • [8] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervised prediction,” in ICML, 2017.
  • [9] C. M. Bishop, Pattern recognition and machine learning. springer, 2006.
  • [10] P. A. Gagniuc, Markov chains: from theory to implementation and experimentation. John Wiley & Sons, 2017.
  • [11] C. C. White, Markov decision processesMarkov decision processes, pp. 484–486. Boston, MA: Springer US, 2001.
  • [12] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. The MIT Press, second ed., 2018.
  • [13] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
  • [14] Y. Burda, H. Edwards, A. J. Storkey, and O. Klimov, “Exploration by random network distillation,” CoRR, vol. abs/1810.12894, 2018.
  • [15] Y. Burda, H. Edwards, D. Pathak, A. J. Storkey, T. Darrell, and A. A. Efros, “Large-scale study of curiosity-driven learning,” CoRR, vol. abs/1808.04355, 2018.
  • [16] D. Silver, “Lecture 2: Markov decision process.”
  • [17] R. Bellman et al., “The theory of dynamic programming,” Bulletin of the American Mathematical Society, vol. 60, no. 6, pp. 503–515, 1954.
  • [18] E. Vasilaki, “Lecture notes in adaptive intelligence,” Department of Computer Science, University of Sheffield, 2019.
  • [19] D. Silver, “Lecture 4: Model free prediction.”
  • [20] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, pp. 279–292, May 1992.
  • [21] H. V. Hasselt, “Double q-learning,” in Advances in Neural Information Processing Systems 23 (J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, eds.), pp. 2613–2621, Curran Associates, Inc., 2010.
  • [22] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015.
  • [23] R. Morris, “D.o. hebb: The organization of behavior, wiley: New york; 1949,” Brain Research Bulletin, vol. 50, no. 5, p. 437, 1999.
  • [24] D. E. Rumelhart, G. E. Hinton, R. J. Williams, et al., “Learning representations by back-propagating errors,” Cognitive modeling, vol. 5, no. 3, p. 1, 1988.
  • [25] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks, vol. 4, no. 2, pp. 251 – 257, 1991.
  • [26] Y. LeCun, “A theoretical framework for back-propagation,” in Proceedings of the 1988 Connectionist Models Summer School (D. Touretzky, G. Hinton, and T. Sejnowski, eds.), (CMU, Pittsburgh, Pa), pp. 21–28, Morgan Kaufmann, 1988.
  • [27] W. Hossack, “Fourier transform,” 2019.
  • [28] I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, “Chapter 10 - deep learning,” in Data Mining (Fourth Edition) (I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, eds.), pp. 417 – 466, Morgan Kaufmann, fourth edition ed., 2017.
  • [29] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 315–323, 2011.
  • [30] C. E. Shannon, “A mathematical theory of communication,” Bell system technical journal, vol. 27, no. 3, pp. 379–423, 1948.
  • [31] J. Shore and R. Johnson, “Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy,” IEEE Transactions on Information Theory, vol. 26, pp. 26–37, January 1980.
  • [32] J. Shore and R. Johnson, “Properties of cross-entropy minimization,” IEEE Transactions on Information Theory, vol. 27, no. 4, pp. 472–482, 1981.
  • [33] C. D. Sa, “Non-convex optimization.”
  • [34] D. E. Rumelhart, G. E. Hinton, R. J. Williams, et al., “Learning representations by back-propagating errors,” Cognitive modeling, vol. 5, no. 3, p. 1, 1988.
  • [35] G. Hinton, N. Srivastava, and K. Swersky, “Neural networks for machine learning lecture 6a overview of mini-batch gradient descent,” Cited on, vol. 14, p. 8, 2012.
  • [36] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [37] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” CoRR, vol. abs/1602.01783, 2016.
  • [38] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu, “IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures,” CoRR, vol. abs/1802.01561, 2018.
  • [39] Y. Wu, E. Mansimov, S. Liao, R. B. Grosse, and J. Ba, “Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation,” CoRR, vol. abs/1708.05144, 2017.
  • [40] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, Y. Wu, and P. Zhokhov, “Openai baselines.”, 2017.
  • [41] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” CoRR, vol. abs/1509.06461, 2015.
  • [42] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust region policy optimization,” CoRR, vol. abs/1502.05477, 2015.
  • [43] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017.
  • [44] M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. G. Azar, and D. Silver, “Rainbow: Combining improvements in deep reinforcement learning,” CoRR, vol. abs/1710.02298, 2017.
  • [45] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu, “Reinforcement learning with unsupervised auxiliary tasks,” CoRR, vol. abs/1611.05397, 2016.
  • [46] M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos, “Unifying count-based exploration and intrinsic motivation,” CoRR, vol. abs/1606.01868, 2016.
  • [47] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba, “Hindsight experience replay,” CoRR, vol. abs/1707.01495, 2017.
  • [48] “Cartpole-v1 documentation.” Accessed: 2019-08-25.
  • [49] “Acrobot-v1 documentation.” Accessed: 2019-08-25.
  • [50] “Mountaincar-v0 documentation.” Accessed: 2019-08-25.
  • [51] S. Laue, M. Mitterreiter, and J. Giesen, “Computing higher order derivatives of matrix and tensor expressions,” in Advances in Neural Information Processing Systems, pp. 2750–2759, 2018.
  • [52] Z. Wang, N. de Freitas, and M. Lanctot, “Dueling network architectures for deep reinforcement learning,” CoRR, vol. abs/1511.06581, 2015.
  • [53] S. Aksoy and R. M. Haralick, “Feature normalization and likelihood-based similarity measures for image retrieval,” Pattern Recognition Letters, vol. 22, no. 5, pp. 563 – 582, 2001. Image/Video Indexing and Retrieval.
  • [54] Y. Burda, H. Edwards, A. J. Storkey, and O. Klimov, “Random network distillation.”, 2019.
  • [55] “Clip by normalised gradient.”, 2019.
  • [56] G. Kalweit and J. Boedecker, “Uncertainty-driven imagination for continuous deep reinforcement learning,” in Proceedings of the 1st Annual Conference on Robot Learning (S. Levine, V. Vanhoucke, and K. Goldberg, eds.), vol. 78 of Proceedings of Machine Learning Research, pp. 195–206, PMLR, 13–15 Nov 2017.
  • [57] “The ubiquity of model-based reinforcement learning,” Current Opinion in Neurobiology, vol. 22, no. 6, pp. 1075 – 1081, 2012. Decision making.