Abstract
Reinforcement learning has seen great advancements in the past five years.The successful introduction of deep learning in place of more traditionalmethods allowed reinforcement learning to scale to very complex domainsachieving superhuman performance in environments like the game of Go ornumerous video games. Despite great successes in multiple domains, these newmethods suffer from their own issues that make them often inapplicable to thereal world problems. Extreme lack of data efficiency, together with hugevariance and difficulty in enforcing safety constraints, is one of the threemost prominent issues in the field. Usually, millions of data points sampledfrom the environment are necessary for these algorithms to converge toacceptable policies. This thesis proposes novel Generative Adversarial Imaginative ReinforcementLearning algorithm. It takes advantage of the recent introduction of highlyeffective generative adversarial models, and Markov property that underpinsreinforcement learning setting, to model dynamics of the real environmentwithin the internal imagination module. Rollouts from the imagination are thenused to artificially simulate the real environment in a standard reinforcementlearning process to avoid, often expensive and dangerous, trial and error inthe real environment. Experimental results show that the proposed algorithmmore economically utilises experience from the real environment than thecurrent stateoftheart Rainbow DQN algorithm, and thus makes an importantstep towards sample efficient deep reinforcement learning.
Quick Read (beta)
University of Birmingham
Undergraduate Thesis
Generative Adversarial Imagination for Sample Efficient Deep Reinforcement Learning
Kacper P. Kielak
BSc Artificial Intelligence and Computer Science
Student ID: 1698133
Supervised by
Dr. Per Kristian Lehre
School of Computer Science, University of Birmingham
\monthname[4], 2019
Abstract
Reinforcement learning has seen great advancements in the past five years. The successful introduction of deep learning in place of more traditional methods allowed reinforcement learning to scale to very complex domains achieving superhuman performance in environments like the game of Go or numerous video games.
Despite great successes in multiple domains, these new methods suffer from their own issues that make them often inapplicable to the real world problems. Extreme lack of data efficiency, together with huge variance and difficulty in enforcing safety constraints, is one of the three most prominent issues in the field. Usually, millions of data points sampled from the environment are necessary for these algorithms to converge to acceptable policies.
This thesis proposes novel Generative Adversarial Imaginative Reinforcement Learning algorithm. It takes advantage of the recent introduction of highly effective generative adversarial models, and Markov property that underpins reinforcement learning setting, to model dynamics of the real environment within the internal imagination module. Rollouts from the imagination are then used to artificially simulate the real environment in a standard reinforcement learning process to avoid, often expensive and dangerous, trial and error in the real environment.
Experimental results show that the proposed algorithm more economically utilises experience from the real environment than the current stateoftheart Rainbow DQN algorithm, and thus makes an important step towards sample efficient deep reinforcement learning.
Acknowledgments
I would like to express my great gratitude to Dr Per Kristian Lehre for his invaluable guidance over the past year. His outofthebox suggestion during the planning phase of the project allowed me to put together different ideas that finally led to the work presented in this thesis.
In addition, I would like to thank Dr Lukasz Kaiser from Google Brain for finding the time to answer my questions regarding his recent findings.
Contents
1 Introduction
One of the most prominent dilemmas in the field of artificial intelligence is to produce fully independent agents that learn optimal behaviour and develop over time purely by trial and error interaction with the surrounding environment. A mathematical framework that encapsulates the problem of these autonomous systems is reinforcement learning (RL) (Sutton & Barto, 1998). Although over the past few years exceptional progress has been made in devising artificial agents that can learn and solve problems in a variety of domains using RL approaches (Arulkumaran et al., 2017), these techniques are still not ideal. They require an immense amount of nonoptimal interaction with the real environment before they begin to operate acceptably well and they do not efficiently adapt to new tasks, even within the identical environmental setting (Irpan, 2018).
So far RL researchers were concentrating on mastering games like backgammon, chess, go, or various video games. In these settings dynamics of the environment are either entirely known and thus can be simulated (rules of the board games), or they can be queried and reset infinitely many times without any additional costs (video games). It allowed producing infinite amounts of data for the agent to learn. However, these kinds of conditions are rare in the real world. Dynamics of the environment are usually unknown and are too involved to approximate using rulebased methods. Often, we also cannot let the agent do millions of arbitrary trialanderror live experiments freely.
It is simple to imagine the use of reinforcement learning agent to optimise user experience on the website. Every bad decision in the real environment may result in a loss of an unsatisfied customer. Millions of such choices, before the agent converges to the optimal policy is too big of a risk for any company. Another example can be an autonomous car accustomed to driving in a specific country that suddenly finds itself in another country with a completely different driving culture. A human could quickly adapt to the new reality, but a current stateoftheart reinforcement learning agent would require enormous amounts of experience first, highly increasing a probability of a severe accident.
Modelbased reinforcement learning algorithms promise to solve this obstacle by using known dynamics of the environment to analyse probable scenarios. The agent can imagine various circumstances, and learn or reason based on them, without actually executing expensive trialanderror exploration in the real world. Use of internal forecasts of the world for decision making and reasoning was deeply examined within the neuroscience community (Tolman, 1948; Hassabis et al., 2007; Schacter et al., 2012). It has been demonstrated to exist within the learning process of humans and several animals (Pfeiffer & Foster, 2013; Leinweber et al., 2017). Again, it is manageable from the RL perspective when we fully understand the model of the environment as exhibited by the AlphaGo (Silver et al., 2017). However, as discussed earlier, we do not always know the exact specifics of the environment and, frequently, we have no prior knowledge regarding its dynamics at all.
One plausible solution to that difficulty could be learning the model of the environment instead. Some work has been done already on the subject. Most prominently, Oh et al. (2015) and Leibfried et al. (2016) showed that the dynamics of the environment can be modelled with very high accuracy. Nonetheless, although learned ’imaginative’ model helped to improve outcomes in environments that require longterm planning, it did not significantly reduce the size of the system exposure required for training a wellperforming agent (Racanière et al., 2017). This brings about the following questions:

•
Can learning the imaginative model of the environment be more data efficient than learning an optimal policy?

•
If so, can the learned imagination fulfil the promise of sample efficient modelbased RL in settings where dynamics of the real environment are unknown?
This study subsequently answers both of the questions. It combines a few recent and a few less recent ideas from the field to do so. It hypothesises that recent advancement in the generative adversarial networks architecture (Goodfellow et al., 2014), and inherent to the RL setting Markov property can provide a positive answer to the first question. Furthermore, it considers that the potential use of imagination within the structure similar to the DynaQ algorithm (Sutton, 1990) may indeed profoundly improve sample efficiency of RL in unknown environments.
Following from the introduction, this thesis is structured as follows: Section 2 provides scientific background and basic theoretical fundamentals necessary for full understanding of the conducted research. Section 3 gives an overview of already existing studies that are relevant to this topic. Section 4 introduces novel Generative Adversarial Imaginative Reinforcement Learning (GAIRL) algorithm developed to test the abovestated hypothesis. Section 5 then goes into details of methods that were used to efficiently create accurate imagination, i.e. learn the model of the environment. Section 6 describes an experimental setting used to evaluate newly proposed GAIRL algorithm and compare it to the current stateoftheart. Section 7 presents the qualitative results that are further discussed in section 8. Finally, section 9 summarises the work carried out.
2 Background
2.1 Reinforcement learning
Reinforcement learning is the problem of learning an optimal policy (i.e. behaviour) for a given environment (Sutton & Barto, 1998). RL is formalised by Markov decision processes (MDPs). An MDP can be formulated as a tuple $(S,A,T,R,\gamma )$ where $S$ is a (discrete or continuous) set of possible states, $A$ is a (discrete or continuous) set of allowed actions, $T$ is a transition probabilities function, $R$ is a (deterministic or stochastic) reward function, and $\gamma \in (0;1)$ is a discount factor for future rewards (controls agent’s time preference). At any given time $t$, the reinforcement learning agent observes an environment state ${s}_{t}\in S$ and selects an action ${a}_{t}\in A$. Then, the reward ${r}_{t+1}={R}_{{s}_{t}}^{{a}_{t}}$ is returned from the environment, and the environment moves to the state ${s}_{t+1}\in S$ with transition probability ${T}_{{s}_{t}{s}_{t+1}}^{{a}_{t}}=P({s}_{t+1}{s}_{t},{a}_{t})$. $T$ and $R$ fully describe the dynamics of the environment.
A critical characteristic of MDPs is that it follows the Markov assumption. Namely, at any point in time $t$, history of encountered states ${s}_{0},{s}_{1},\mathrm{\dots},{s}_{t1},{s}_{t}$ can be simplified to the last state ${s}_{t}$ without any loss in information, i.e:.
$$P({s}_{t+1}{s}_{t})=P({s}_{t+1}{s}_{1},\mathrm{\dots},{s}_{t})$$  (1) 
It is an assumption required by RL algorithms. Unfortunately, this property is not always observable in the real world. We can often circumvent its absence by clever preprocessing of the state space $S$ or by translating this lack of information to additional stochasticity on top of the transition and reward functions. However, even with these countermeasures, lack of Markov assumption may limit the range of optimal strategies that our agent can learn.
The agent interacts with an MDP by selecting actions accordingly to the policy $\pi $ that maps states to the probability distribution over A. The goal of obtaining an optimal policy can be formulated as learning a policy ${\pi}^{*}$ that maximises the state value function ${V}^{\pi}:S\to \mathbb{R}$ (Bellman equation):
$${V}^{\pi}(s)=\sum _{a\in A}\pi (as)({R}_{s}^{a}+\gamma \sum _{{s}^{\prime}\in S}{T}_{s{s}^{\prime}}^{a}{V}^{\pi}({s}^{\prime}))$$  (2) 
From which we can also define stateaction value function ${Q}^{\pi}:S\times A\to \mathbb{R}$:
$${Q}^{\pi}(s,a)={R}_{s}^{a}+\gamma \sum _{{s}^{\prime}\in S}{T}_{s{s}^{\prime}}^{a}\sum _{{a}^{\prime}\in A}\pi ({a}^{\prime}{s}^{\prime}){Q}^{\pi}({s}^{\prime},{a}^{\prime})$$  (3) 
There is always at least one optimal policy and it satisfies/entails the following:
$$\underset{\pi}{\forall}\underset{s\in S}{\forall}{V}^{{\pi}^{*}}(s)\ge {V}^{\pi}(s)$$  (4) 
$${V}^{{\pi}^{*}}(s)=\underset{a\in A}{\mathrm{max}}({R}_{s}^{a}+\gamma \sum _{{s}^{\prime}\in S}{T}_{s{s}^{\prime}}^{a}{V}^{{\pi}^{*}}({s}^{\prime}))$$  (5) 
$${Q}^{{\pi}^{*}}(s,a)={R}_{s}^{a}+\gamma \sum _{{s}^{\prime}\in S}{T}_{s{s}^{\prime}}^{a}\underset{{a}^{\prime}\in A}{\mathrm{max}}{Q}^{{\pi}^{*}}({s}^{\prime},{a}^{\prime})$$  (6) 
The setting could potentially be directly solved employing simple linear algebra: ${V}^{{\pi}^{*}}={(I\gamma T)}^{1}R$; however, it works exclusively for finitestate MDPs and its time complexity is $O({n}^{2.4})$ (Coppersmith & Winograd, 1990). It is computationally infeasible for large and complex environments that are encountered in a vast majority of conditions. Therefore, it is rarely used, and most of the focus is given to algorithms that leverage sampling of the environment to approximate the optimal solution.
There are three different types of RL algorithms:
 •

•
Policybased  do not learn state value function explicitly but instead directly learn policy mapping $\pi :S\to A$. E.g. REINFORCE (Williams, 1992).

•
Actorcritic  combines learning policy (actor) and value function (critic). The actor makes choices about actions but it is updated by the feedback from the critic who then directly interacts with the environment. E.g. natural actorcritic (Peters et al., 2005).
Valuebased methods are usually the simplest and the fastest. The downside is that they do not work in continuous action spaces or when we want to learn stochastic policy. Policybased methods have better convergence qualities and can handle more complex policies and action spaces but are usually inefficient and converge to local minima. Actorcritic methods try to connect the advantages of both.
These traditional algorithms allowed for a big advancement in computer science creating artificial agents that have reached humanlevel performance in multiple games (e.g. backgammon (Tesauro, 1995)) without any prior knowledge. They work flawlessly for smaller state and action spaces where state value function (2) and/or stateaction value function (3) can be expressed by the lookup table. A tabular representation of these functions, however, is often not viable due to too big or continuous domains.
2.2 Deep reinforcement learning
To allow RL to scale to more involved situations, one can propose using function approximators (such as linear regression, neural networks, or Bayesian methods) to approximate stateaction value function $Q$ or policy $\pi $. Unfortunately, recursive updates of the functions (2) (3) and lack of independent and identically distributed variables (future states/rewards highly depends on current state and action) in the RL setting break most of the assumptions expected by standard machine learning methods. Therefore, trials of plugging nonlinear function approximators into the existing algorithms in the place of a tabular representation were resulting in complete divergence and failure.
Because it limited scalability of RL to many of the realworld problems, considerable amount of work has been done to solve this obstacle. Recently, Mnih et al. (2015) introduced Deep QNetwork (DQN), an RL algorithm that for the first time was capable of leveraging blackbox properties of deep learning. They bypassed the dilemma of recursive updates and lack of i.i.d. by proposing two adjustments to the standard Qlearning algorithm (Watkins & Dayan, 1992).
Replay buffer  rather than instantly learning from sampled data that is highly correlated, algorithm stores $N$ most recent tuples $({s}_{t},{a}_{t},{r}_{t+1},{s}_{t+1})$ in a replay buffer $D$. When updating the value function $Q$, it uses a random minibatch of samples from $D$ to estimate gradients. It reduces the correlation between samples by breaking their ordering.
Target network  instead of updating $Q$ function directly with itself, the algorithm maintains two distinct networks: the online network $Q$ and the target network $\widehat{Q}$. $\widehat{Q}$ is simply a fixed snapshot of $Q$ taken every $C$ updates. The agent determines actions in a conventional manner accordingly to the network $Q$, but $Q$ is updated using a revised Bellman equation:
$$Q(s,a)={R}_{s}^{a}+\gamma \sum _{{s}^{\prime}\in S}{P}_{s{s}^{\prime}}^{a}\sum _{{a}^{\prime}\in A}\pi ({a}^{\prime}{s}^{\prime})\widehat{Q}({s}^{\prime},{a}^{\prime})$$  (7) 
It stabilises the whole learning process and avoids exploding gradients by partially eliminating recursiveness in network updates.
DQN achieved superhuman level performance on Atari 2600 games from the Arcade Learning Environment (ALE) (Bellemare, Naddaf, Veness & Bowling, 2013) with raw pixels as input alone. This contribution opened an enormous amount of opportunities for RL. A lot of new algorithms and DQN improvements followed.
Schaul et al. (2015) enhanced DQN slightly increasing its data efficiency by applying prioritised experience replay to more frequently replay more informative samples. Van Hasselt et al. (2016) devised Double DQN that extends DQN to the double Qlearning method (Van Hasselt, 2010) that addresses an overestimation bias of the standard Qlearning. Wang et al. (2015) introduced a new duelling network architecture specifically for the valuebased RL that outperformed conventional supervised machine learning architectures. Bellemare et al. (2017) produced DQNbased algorithm that, alternatively to learning scalar stateaction value function $Q$, learns a categorical distribution of the future returns $Z$ whose expectation is $Q$ proving both theoretically and experimentally that this procedure improves the original DQN algorithm. Hessel et al. (2017) then consolidated all improvements enumerated in this paragraph (and a few smaller ones) into an algorithm called Rainbow DQN. It is the current RL stateoftheart for discrete action spaces.
However, DQN, as a deep variant of Qlearning (valuebased method), does not work in case of continuous actions space. Hence, it was soon merged with deterministic policy gradient methods (Silver et al., 2014) to create the Deep Deterministic Policy Gradient (DDPG) actorcritic algorithm (Lillicrap et al., 2015) solving this issue. Recently, following the success of the Rainbow DQN, multiple incremental refinements suitable to the DDPG were consolidated forming Distributed Distributional Deep Deterministic Policy Gradient (D4PG) algorithm (BarthMaron et al., 2018). D4PG is the current stateoftheart algorithm for complicated, continuous action space settings.
Although, the introduction of deep neural networks for the first time enabled RL algorithms to solve extremely complex problems and surpass humans at many levels, all of them suffer from tremendous data inefficiency. Rainbow DQN requires around 15 million frames of interaction with the real environment to match median human performance. It corresponds to over 8 days of constant play at the regular rate of 20 frames per second. It requires a total of full 200M frames to reach its peak performance (over three full months). In contrast, median human performance is defined as a score achieved by a person after merely 15 minutes of training beforehand.
As explained in the introduction, this is not an enormous problem when training agent to master board or video games as computational power is relatively cheap nowadays. Three months of 20 frames per second can be shortened to 1 hour of $\sim 45000$ frames per second given powerful enough infrastructure. However, it makes deep RL inapplicable to any other problem where obtaining samples of experience comes with potential additional costs like losing unsatisfied customers or causing an accident.
2.3 Generative adversarial networks
Recently, Goodfellow et al. (2014) introduced generative adversarial networks (GANs). They became a successful and widespread tool for data generation, profoundly overperforming previously used methods. Contrary to standard approaches that primarily focused on minimising L1 (8) or L2 (9) loss between a generated output and a real output on the individual level, GANs make it possible to work on the data distribution level minimising a difference between a generated data distribution and a real data distribution instead.
$$L1=\sum _{i=1}^{n}{x}_{generated}{x}_{true}$$  (8) 
$$L2=\sum _{i=1}^{n}{({x}_{generated}{x}_{true})}^{2}$$  (9) 
GANs work by defining two separate networks. The generator $G:Z\to X$ that maps a noise vector $z\in Z$ coming from the noise distribution ${p}_{z}$ onto the data space $X$, and the discriminator $D:X\to [0,1]$ that maps an input from the data space $X$ to the probability of the input coming from the real data distribution ${p}_{data}$. Both networks play a twoplayer minimax game with the objective:
$$\underset{G}{\mathrm{min}}\underset{D}{\mathrm{max}}\underset{x\sim {p}_{data}}{\mathbb{E}}[\mathrm{log}(D(x))]+\underset{z\sim {p}_{z}}{\mathbb{E}}[\mathrm{log}(1D(G(z)))]$$  (10) 
As theoretically proven in the original study, this minimax game is equivalent to minimising the JensenShanon (JS) divergence between ${p}_{data}$ and ${p}_{g}$:
$$JS({p}_{data}{p}_{g})\propto \underset{x\sim {p}_{data}}{\mathbb{E}}[\mathrm{log}\frac{{p}_{data}(x)}{{p}_{data}(x)+{p}_{g}(x)}]+\underset{x\sim {p}_{g}}{\mathbb{E}}[\mathrm{log}\frac{{p}_{g}(x)}{{p}_{data}(x)+{p}_{g}(x)}]$$  (11) 
and thus, modelling ${p}_{g}$ to as closely resemble ${p}_{data}$ as possible.
Unfortunately, GANs are also known for their lack of stability in training, often causing situations where one of the networks starts to completely overwhelm the other. This results in diminishing gradients for both of the networks. They also tend to ignore certain spectrums of the distribution (mode collapse problem). Therefore, numerous researchers tried to stabilise the original GAN algorithm. One of the most successful improvements was replacing JS divergence with the EarthMover distance (1st Wasserstein metric) creating the Wasserstein GAN (Arjovsky et al., 2017):
$$\underset{G}{\mathrm{min}}\underset{D}{\mathrm{max}}\underset{x\sim {p}_{data}}{\mathbb{E}}[D(x)]+\underset{z\sim {p}_{z}}{\mathbb{E}}[D(G(z))]$$  (12) 
Now, the discriminator (in the paper called critic) is no longer trying to predict if the data comes from the real or fake distribution. It is rather providing an actual realvalued distance (as measured by the EarthMover metric) between the data generated by the generator and the data coming from the real distribution. Hence, the goal is not longer to balance both networks but to make sure that the critic can converge to the real distance before letting the generator to improve. It has been mathematically proven in the study that the Wasserstein GAN always converges given that the critic is infinitely more powerful than the generator.
The issue with the Wasserstein GAN is that it constructs its minmax value function using the KantorovichRubinstein duality (Villani, 2008). Therefore, to be theoretically and practically sound, the critic needs to represent values coming from the set of 1Lipschitz functions:
$$f\text{is 1Lipschitz}\iff f({x}_{1})f({x}_{2})\le {x}_{1}{x}_{2}$$  (13) 
The original Wasserstein GAN enforced this limitation by clipping the weights of the critic network to space $[c;c]$. Setting the $c$ hyperparameter is, however, a nontrivial task that introduced new instability problems.
Fortunately, Gulrajani et al. (2017) circumvented this concern by appending a gradient penalty to the final minimax objective instead of clipping weights of the critic:
$$\underset{G}{\mathrm{min}}\underset{D}{\mathrm{max}}\underset{x\sim {p}_{data}}{\mathbb{E}}[D(x)]+\underset{z\sim {p}_{z}}{\mathbb{E}}[D(G(z))]+\underset{\xaf}{\lambda \underset{\widehat{x}\sim N(0,1)}{\mathbb{E}}[{({{\nabla}_{\widehat{x}}D(\widehat{x})}_{2}1)}^{2}]}$$  (14) 
where $\lambda $ is a gradient penalty factor hyperparameter. $\lambda =10$ in the original paper and $\lambda =0$ recovers original Wassersetein GAN objective.
Unfortunately, the choice between the original GAN and the Wasserstein GAN with gradient penalty (WGANGP) is not that trivial. Lucic et al. (2018) showed that WGANGP is not necessarily outperforming standard GAN given suitable hyperparameters configuration. WGANGP also takes much longer to train because the critic, for every step of the generator, has to converge to the appropriate value of the Wasserstein distance fully. However, WGANGP’s hyperparameters are much simpler to optimise. The choice between both usually comes down to the tradeoff between computer power consumed by training and researcher/developer time spent on tuning the hyperparameters.
Furthermore, much research has been devoted to conditional GANs where the generator, rather than taking only a random noise $z$ as an input, takes another value $y$ based on which the final output is conditioned (Mirza & Osindero, 2014). This enables us to not only generate arbitrary samples that follow the data distribution but also to have an influence on what precise spectrum of the distribution to obtain. This is particularly valuable when fitting GANs to the RL setting where the reward and next state tuple $({r}_{t+1},{s}_{t+1})$ is conditioned on the current state and chosen action pair $({s}_{t},{a}_{t})$. The stateoftheart in conditional GANs, especially in the area of computer vision, is PIX2PIX GAN that exhibited extraordinary results in multiple imagetoimage translation problems (Isola et al., 2017).
3 Related work
3.1 Modelbased reinforcement learning
As briefly discussed in section 1, one of the promises of the modelbased reinforcement learning is to drastically improve the sample efficiency of reinforcement learning. The agent with access to the transition matrix $T$ and reward function $R$ (section 2.1) could internally reason about potential scenarios and outcomes without performing risky exploration in the real environment. Unfortunately, the specific characteristics of the RL environment are usually unknown, and hence these methods cannot be directly applied.
3.1.1 Learning the model  imagination
One plausible solution to that difficulty could be learning the model of the environment instead. The idea of learning the model of the environment from the sampled experience when its dynamics are not fully known is one of the most important concepts adapted in this thesis. However, this notion is not new. Current investigation in this area concentrates on application of variational autoencoders (VAEs) (Kingma & Welling, 2013), recurrent neural networks (RNNs) (Medsker & Jain, 1999), and/or Bayesian methods. Lenz et al. (2015) introduced a novel robot control algorithm leveraging RNN that predicted the position of robot’s parts. Bellemare, Veness & Bowling (2013) factors the state space to decompose model learning into smaller, more manageable subproblems and applies Bayesian inference to predict future states.
A substantial breakthrough in learning the model of the environment in the RL setting and the algorithm that is most widely applicable was proposed by Oh et al. (2015). The paper presented two novel EncodingTransformationDecoding architectures to learn the transition probabilities function $T$. They first encode highdimensional state ${s}_{t}$ using convolutional network (LeCun et al., 1999), then the transformation conditioned on ${a}_{t}$ is performed to convert a highlevel encoding of the of the current state to a highlevel encoding of the next state, and finally decoding using deconvolutional network (Zeiler et al., 2010) is executed to decode highlevel nextstate features into the full representation of the next state ${s}_{t+1}$. The first architecture employs a simple feedforward method and takes a fixed history of states as an input. The other takes advantage of LSTM cells (Hochreiter & Schmidhuber, 1997) to capture the most relevant features from the past.
This work was later extended through a straightforward modification to support modelling of reward function $R$, and thus, to able to learn the full model of the environment (Leibfried et al., 2016). It was also incrementally improved by Chiappa et al. (2017) by the alteration of the original architecture and exploration of a few novel ideas.
Nevertheless, these model generation techniques have two major areas for improvement:

•
They do not utilise Markov property (1) fully, treating the model learning as a highly sequential and nonstationary problem instead of transforming it into much easier, stationary scenario. It makes capturing all intrinsics of the environment much harder to learn.

•
They utilise L1 or L2 objective to train the model and predict future states. L1/L2 loss penalises each difference between prediction and ground truth uniformly. Thus, it struggles to prioritise small but significant features over large but meaningless (e.g. slight ball location change over reduced saturation of a black background in Pong Atari game) and often produces blurry results. This is a wellknown problem when applying standard classification/regression deep learning models for generative tasks.
3.1.2 Applications of imagination
Historically, there has been a significant amount of work devoted to modelbased reinforcement learning methods that took advantage of known environment dynamics. Firstly, as already mentioned in section 2.1, having that knowledge allows to directly solve the RL setting using simple linear algebra. If state/action is too intricate for that, we can apply traditional planning techniques. Additionally, Monte Carlo tree search (Browne et al., 2012) can be deployed for more reliable evaluation of the future rewards. Several modelbased techniques like DQNcognate methods coupled with prior knowledge about the intrinsic model of the environment have already yielded outstanding effects (Silver et al., 2016, 2017).
Sadly, using these techniques directly on the imagination (approximated real environment) has been shown to perform very poorly (Talvitie, 2014, 2015). It is because approximations, by their nature, introduce a certain level of error. This error quickly compounds over time when performing long environment rollouts into the future. Utilising this approximation to estimate policy using planninglike methods is directly translating this compounded error into estimated policy causing it to be, in the best case, suboptimal.
One of the first algorithms that managed to do that successfully was DynaQ (Sutton, 1990). It used a lookup table for imagination modelling. The imagination was later used to simulate and replace the real environment. Unfortunately, it did not scale to more complex problems due to the use of a finite lookup table. The highlevel structure of the algorithm proposed in this thesis is inspired by the traditional DynaQ and can be partially viewed as its expansion to bigger and/or continuous state spaces.
Most recently, Racanière et al. (2017) successfully employed model learning techniques described in section 3.1.1 by combining them with the standard modelfree methods in the decision making process. Their results showed improvements in the effective exploration and in the environments where longterm planning is crucial, even, when the imagination of the agent was far from perfect. Nevertheless, it did not significantly reduce the size of the system exposure necessary for training a wellperforming agent.
Recent work that most closely resembles this thesis was done by Venkatraman et al. (2016) and Gu et al. (2016). They also try to improve sample efficiency of RL by using DynaQ inspired algorithms. However, both of the studies adopt simple, often linear, methods to train the imagination. Hence, they can only be applied to specific domains where such methods are successful. Additionally, their data efficiency is not specifically better than the current modelfree stateoftheart Rainbow DQN (Hessel et al., 2017).
3.2 GANs in reinforcement learning
GANs and RL are usually treated as separate fields of machine learning research by the community. There was not much attention to combining them or employing advances from one domain to improve the other. Recent work by Pfau & Vinyals (2016) presented a deep connection between GANs and actorcritic RL trying to encourage both communities to learn from each other.
A few studies to utilise GAN architecture to devise novel RL algorithms followed. Doan et al. (2018) introduced GAN Qlearning, a modelfree distributional alternative to the DQN algorithm, by using generative adversarial architecture to express Bellman update (3) implicitly. Ho & Ermon (2016) obtained significant performance gains in imitation learning (Schaal et al., 2003) by applying a novel Generative Adversarial Imitation Learning algorithm. Henderson et al. (2018) proposed an innovative OptionGAN architecture for inverse reinforcement learning (Ng et al., 2000) improving results in oneshot transfer.
This thesis proposes the use of GANs to learn the dynamics of the environment and form the agent’s imagination. Similar approach was presented by Xiao & Kesineni (2016) and Azizzadenesheli et al. (2018). They both examined learning the model of the environment for Atari games to, similarly to the AlphaGo (Silver et al., 2017), apply a combination of DQN with the Monte Carlo tree search to effectively discover the optimal policy. Both studies reported negative results due to the too short MCTS rollouts as proven in the former. Azizzadenesheli et al. (2018), however, was able to remarkably efficiently learn a very accurate representation of model dynamics using generative adversarial approach.
4 Imaginative framework for data efficiency
Building upon previous research, this thesis introduces Generative Adversarial Imaginative Reinforcement Learning (GAIRL) algorithm to answer questions posed in section 1 and to improve overall data efficiency of deep RL. As briefly mentioned in section 3, study by Sutton (1990) inspired the main structure of the novel algorithm. This structure is a focus of this section, without going much into details of the generative adversarial imagination. Transforming the imaginative framework specifically into the GAIRL algorithm is a topic of the next section.
The imaginative core underpinning GAIRL is divided into two separate modules: the modelfree module (MFM) and the imagination module (IM). It also makes use of the concept of memory $M$. The memory is simply an array storing previous realenvironment experience, similarly to the replay buffer in the DQNcognate algorithms that was described in section 2.2.
These modules are utilised across three distinct training phases. First is the modelfree phase (MFP) that only makes use of the MFM, followed by the imagination training phase (ITP) where solely the IM is operated, finalised by the imaginationbased phase (IBP) that combines both of the modules.
4.1 Modelfree module
The MFM consists of a standard modelfree reinforcement learning algorithm; in this study, it is the stateoftheart Rainbow DQN. It is the core decisionmaking module that models policy mapping $\pi :S\to A$ of the agent. The advantage of the imaginative framework is that it can leverage any other existing modelfree algorithm if it is more suited for a given task (e.g. D4PG in environments with continuous action space) or if a modelfree algorithm that overperforms Rainbow DQN is introduced in the future.
4.2 Imagination module
The IM is a crucial part of the whole framework. It is a trainable system that can simulate the behaviour of the environment dynamics, i.e. accurately approximate the transition probabilities function $T$ and the reward function $R$, similarly to the approach presented by Leibfried et al. (2016). Similarly to the MFM, any generative model can be employed in place of the IM. Nonetheless, this thesis focuses on a specially designed IM to optimise its data efficiency. In depth design of the IM to create the GAIRL algorithm can be found in section 5.
4.3 Modelfree phase
The MFP mostly follows the standard RL procedure taking advantage of the modelfree RL algorithm employed within the MFM. Similarly to the standard RL, it is based entirely on the real environment. The only difference is that, in addition to the use of experience samples returned by the environment to improve the policy $\pi $ encapsulated in the MFM, it also stores these samples within the agent’s memory $M$.
MFP is the only one out of three phases that requires the real environment to sample experience. Therefore, to limit the amount of the real experience that is required, it is also recommended to keep it as short as possible.
4.4 Imagination training phase
Imagination training phase is focused on using transitions samples (${s}_{t},{a}_{t},{r}_{t+1},{s}_{t+1}$) stored in the memory $M$ to train the IM to accurately follow the dynamics of the real environment (approximate functions $T$ and $R$) in a purely supervised learning manner.
Length of the ITP does not affect the real environment at all as it exploits past, memorised, experience instead. Hence, it can, and it should to produce more optimal results, run until the IM fully converges to the representation of the data that is stored in the memory.
4.5 Imagination based phase
Having fully trained imagination, we can start leveraging it. The IBP is focused on improving the agent’s policy by letting MFM train and rollout experimental scenarios on the IM instead of making potentially expensive trial and error in the real world. In fact, the MFM is not even aware of the fact that its learning process has been moved onto the ’fake’ environment instead of the real one.
IBP, similarly to the ITP, should last until the agent’s policy fully converges to the imaginative environment to extract as much signal, from experience gathered so far, as possible. Nevertheless, in practice, experiments focused on performing only three times more steps in the imagination than in the real environment, even, when the policy did not always fully converged during the imagination based phase.
4.6 Summary
The whole premise of the framework is to extract meaningful signal from the obtained experience as efficiently as possible. Imagination module serves as a way to generalise possible distribution of the world to create a safe, artificial, imaginative environment where the agent can learn and experiment without any risks associated with the actions in the real world.
The three phases work in a loop following Algorithm 4.6. The loop part is crucial. In the first MFP, the agent usually performs completely random and experimental actions and is very likely to not reach more advanced environment states such as the next level in a video game. Going through the ITP and the IBP can allow it to master the first level of the game. However, it needs to go back to the real environment to explore newly reachable states to be able to imagine them in the second iteration accurately. This process should continue until the agent’s policy fully converges to the given environment.
Convenient characteristic of the whole framework is that it can work for any modelfree RL algorithm in place of the MFM and any generative module in place of the IM. Nevertheless, only the framework that specifically leverages GANs and Markov property for the data efficient IM is defined as GAIRL as explained in the next section.
{algorithmic}[1]
\ProcedureGAIRL$MFM,IM,env$
\StateInitialise $MFM$
\StateInitialise $IM$
\StateCreate and initialise $M$
\Whiletrue
\StateTrain $MFM$ on $env$ while collecting experience samples in $M$ (MFP)
\If$MFM$ converged on $env$
\Return$MFM$ agent
\EndIf\StateTrain $IM$ on the data from $M$ (ITP)
\StateTrain $MFM$ on $IM$ (IBP)
\EndWhile\EndProcedure
5 Imagination module structure for GAIRL
The framework described in section 4 is based on the assumption that the IM can learn the dynamics of the real environment accurately, and efficiently enough. As discussed in section 3.1.1, there already exist some studies on this topic. Most prominently Oh et al. (2015) used variational autoencoders in combination with recurrent neural networks to create a model that can predict the next state in Atari games conditioned on the current state and the chosen action. Its results were extremely accurate; however, it did not focus on sample efficiency.
Contrary to the previous studies, the IM of the GAIRL algorithm takes into account Markov property simplifying the whole setting to the straightforward supervised learning problem with an entirely stationary mapping $S\times A\to \mathbb{R}\times S$. It is, therefore, theoretically, much more data efficient, especially compared to the standard modelfree policy learning that tries to learn a behaviour maximising expected cumulative reward, i.e. the extremely nonstationary sum of all future rewards.
Although any supervised learning model that can fit the abovementioned description could be used in the framework, certain algorithms are superior to others. As described in section 2.3, standard deep learning models based on the L1 or L2 loss do not perform well in generative tasks. In theory, the best performing architecture to approximate different highdimensional generative distributions are generative adversarial networks (Goodfellow et al., 2014). GANs should add another advantage over popular variational autoencoder approach. As mentioned in section 3.2, use of the PIX2PIX GAN (Isola et al., 2017) for this setting already yielded remarkably good and sample efficient results in the very recent study (Azizzadenesheli et al., 2018).
Another essential characteristic of GANs is that they take random noise as an input (in addition to the conditional input in case of standard generative models). This quality provides an additional advantage over previously used methods in highly stochastic environments. It should allow GAN, if needed, to produce a set of diverse stochastic outputs for the same conditional input, instead of just a deterministic mean of possible outputs.
Although single GAN seems like a perfect match for the IM, in practice, it is easier to construct the IM using two separate deep learning models. One predicting the next state (modelling transition probabilities function $T$), another predicting the expected reward (modelling reward function $R$). In the proposed framework, the next state generation is handled by a GAN because state space is often highly dimensional structure perfectly suited for that architecture. Reward, however, is always represented by a single, realvalued, scalar. Therefore, it can be simplified to a simple regression problem. Hence, GAIRL employs a traditional L1/L2 loss based regression model for the reward generation task. The full internal structure of the IM module during the ITP can be seen in Figure 7. Figure 8 shows how trained imagination is then used in the IBP.
The final crucial step towards a successful use of the IM in the GAIRL setting was deciding on starting states for the IM rollouts. Real environment simply starts in a particular state. IM, on the other hand, is just a combination of supervised learning models, it does not have any inherently associated internal state. States are merely inputs to the machine learning system. Therefore, in the IBP, whenever there is a need to start a new episode, a random state is sampled from the memory. Just then, having these ground truth initial state, the IM is used to simulate possible scenarios into the future in the standard manner.
6 Experimental setup
6.1 Environments
To asses the capabilities of the algorithms, OpenAI Gym (Brockman et al., 2016) was employed. It provides traditional and most popular reinforcement learning environments in an easy to use manner. For the past four years, the most important reinforcement learning benchmarks were Atari games from the Atari Learning Environment (Bellemare, Naddaf, Veness & Bowling, 2013) (also provided by the OpenAI gym). Unfortunately, RL algorithms require very extensive hardware to converge to the optimal policies for these games (up to 100GB of RAM per single run of the algorithm).
Therefore, due to high time constraints and limited resources, this study employs slightly simpler benchmarks from the classic control family of the problems. It mainly focuses on the MountainCar (Moore, 1990), and Acrobot (Sutton, 1996) environments. Although classic control environments are simpler and operate on lower dimensions, they were the standard benchmark in the reinforcement learning community for decades before Atari games were adopted. Nevertheless, in addition to showing the proof of concept of the GAIRL framework on the MountainCar and Acrobot, the plan is to continue the work to analyse the framework on currently adopted test environments.
6.1.1 MountainCar
MountainCar is based on a simple, lowdimensional setting. The agent controls a car that must drive up the steep slope. The car is not able to climb the hill directly. The agent has to learn to leverage the fact that it is situated in a valley and can use potential energy of the opposite slopes.
The state space is represented by two continuous variables: horizontal position of the car ${x}_{t}$, and velocity of the car across the car’s axis ${v}_{t}$ (${s}_{t}=({x}_{t},{v}_{t})\in {\mathbb{R}}^{2}$). Action space is defined by a single choice: drive left, or drive right (${a}_{t}\in \{1,1\}$). The agent gets a reward $r=1$ for every time step until the episode is terminated (car drives up the right hill). An optimal policy should minimise the time it takes to reach the top of the hill, and thus maximise the cumulative reward obtained during the episode.
To solve the MountainCar environment, the algorithm has to be capable of handling continuous state space. Additionally, what makes it harder to learn is a very sparse reward signal. The agent has no information about the goal until it is reached. A meaningful signal occurs once per from 200 (closetooptimal policy) to over 5000 (random policy) actions.
6.1.2 Acrobot
Acrobot, although still fairly lowdimensional, is harder and much more difficult task. The agent controls a twolink, underactuated robot arm. The first (upper) joint, attached to the background, is out of control of the agent. The only controllable part is the torque of the lower joint of the robot. The goal is to balance the whole arm, so the tip of the second link swings above the episode termination line.
Six continuous variables represent the state space: sinus and cosine of the angle ${\alpha}_{t}$ between the first link and a horizontal line; sinus and cosine of the angle ${\beta}_{t}$ between the first and the second link; an angular velocity ${\omega}_{t}$ of the first joint; and an angular velocity ${\theta}_{t}$ of the second joint.
$${s}_{t}=(\mathrm{sin}{\alpha}_{t},\mathrm{cos}{\alpha}_{t},\mathrm{sin}{\beta}_{t},\mathrm{cos}{\beta}_{t},{\omega}_{t},{\theta}_{t})\in {[1;1]}^{4}\times {R}^{2}$$ 
Action space is again defined by a single choice, although slightly modified: apply positive torque, no torque, or negative torque to the second joint (${a}_{t}\in \{1,0,1\}$). Similarly to the MountainCar, the agent gets a reward $r=1$ for every time step until the episode finishes (tip of the second link swings above the termination line) to encourage policies that take the least amount of time to complete the episode.
6.2 Implementation
All algorithms were implemented using the Tensorflow machine learning library (Abadi et al., 2015) with Python as a frontend programming language. Results and performance of the algorithms were captured and visualised using TensorBoard, a visualisation tool that is a part of the Tensorflow framework.
6.2.1 Modelfree reinforcement learning
Implementation started with the standard Deep QNetwork code following the original paper (Mnih et al., 2015). The initial structure consisted of 2 hidden layers with 24 nodes each to accustom for simpler environments. Although the most popular activation function for hidden layers of deep neural networks is rectifier linear unit (ReLU) ($f(x)=\mathrm{max}(0,x)$) introduced by Nair & Hinton (2010), from my experience it fairly often causes ’dead neurons’ problem, i.e. input to the ReLU is negative (thus output is a constant $0$) causing lack of gradient flowing through the node during the backpropagation. Therefore, from the beginning, DQN was employing leaky rectifier linear units ($f(x)=\mathrm{max}(\alpha x,x)$) with default parameter ${\alpha}_{lrelu}=0.2$ (Maas et al., 2013). Starting configuration was using random minibatches of 32 experience samples from the replay buffer and Adam optimisation technique (Kingma & Ba, 2014) to derive an optimal set of network weights.
It was then debugged and hypertuned until it was able to solve both, previously mentioned, environments easily. Most interesting changes performed during the hypertuning regarded the batch size and the optimisation technique. Initially, the agent was not able to even reach closetooptimal policies. A single change from Adam optimiser to the standard stochastic gradient descent without any additional acceleration (SGD) was enough to fix the issue. Deep reinforcement learning is known to suffer from high instability. Momentumbased optimisation may often exacerbate this problem. Additionally, SGD has been recently shown to have better generalisation properties (Wilson et al., 2017), what may be another reason for its superiority in this setting. Following this change, the agent became able to reach optimallike behaviour; however, it was quickly forgetting what it has learned and collapsing back to suboptimal policies. The solution was to increase the batch size, from 32 to 256 strongly. A batch made out of only 32 samples was often not diverse enough to represent the actual signal coming from the environment, thus again intensifying instabilities of deep RL.
Afterwards, the stateoftheart Rainbow DQN was written inheriting the DQN structure and following improvements described by Hessel et al. (2017). It followed the same debugging and hypertuning procedure. The final Rainbow DQN hyperparameters are presented in Table 1. The same Rainbow DQN was then used both as a baseline and within the modelfree module of the GAIRL framework.
Hyperparameter  Value 

Hidden layers  $[24,24]$ 
Hidden layers activation  Leaky ReLU 
Leakiness parameter (${\alpha}_{lrelu}$)  $0.2$ 
Dropout probability  $0$ 
Final layer activation  Linear 
Optimiser  Gradient descent 
Learning rate (${\alpha}_{lr}$)  $5\times {10}^{3}$ 
Gradient clipping  $1$ 
Discount factor ($\gamma $)  $0.99$ 
Exploration strategy  $\u03f5\text{greedy}$ 
Exploration $\u03f5$ decay  $1\to 0.05\text{(linear)}$ 
Exploration decay start  $1000$ steps 
Exploration decay length  $10000$ steps 
Replay buffer size  $10000$ 
Replay batch size  $256$ 
Prioritisation $\u03f5$  $1\times {10}^{5}$ 
Prioritisation $\alpha $  $0.6$ 
Prioritisation $\beta $ decay  $0.4\to 1\text{(linear)}$ 
Prioritisation decay length  $50000$ 
Noisy networks ${\delta}_{0}$  $0.5$ 
Multistep returns $n$  $3$ 
Online network update frequency  $4$ 
Target network update frequency  $500$ 
6.2.2 Generative models
Following the successful implementation, debugging, and hypertuning reinforcement learning agents, implementation of the second group of necessary algorithms – generative models – has started. All of the neural networks for the generative models were also initially configured to use Leaky ReLU activation for the hidden layers and Adam optimisation for training.
In the first step, I implemented standard GAN based on the original work by Goodfellow et al. (2014). Similarly to the DQN, it was then debugged and hypertuned; however, this time using standard MNIST dataset (LeCun et al., 1999). At first, both generator and discriminator seemed like they entirely lack gradient to progress, yet none of the well known GAN issues occurred. Interestingly, the problem did not lay purely in standard hyperparameters of the networks but in the weights initialisation. Originally, GAN’s weights were initialised according to the same distribution as weights of RL algorithms. In DQN, weights have to be initialised to values with a mean $\mu =0$ and a very low standard distribution to avoid divergence and exploding gradients that can be caused by recursive updates. For GANs, however, these values were too low to produce any meaningful signal. Increasing the standard deviation of initial weights distribution solved the difficulty. Unfortunately, networks imbalance problem that was described in detail in section 2.3 followed. Discriminator started to completely overwhelm generator causing the gradient of both networks to vanish. The solution to that was decreasing the number of discriminator training steps for each generator training step from $k=5$ to $k=1$.
After the original GAN was implemented, both Wasserstein GAN (WGAN) and Wasserstein GAN with gradient penalty (WGANGP) were built on top. Because tuning weight clipping constant $c$ in WGAN can be extremely hard and mundane, after making sure there are no bugs in the code, focus moved to the WGANGP implementation, without fully hypertuning the standard WGAN. Sadly, the WGANGP did not perfectly work on the dataset with the same hyperparameters as the standard GAN. The reason behind it was that, unlike the discriminator in the standard GAN, the critic in the WGANGP has to be much more powerful than the generator so it can fully converge to the real value of the Wasserstein distance between the real and the generated data. Slightly increasing the network size of the critic and the number of critic steps per single generator step from $k=1$ to $k=10$ caused the algorithm to perform much better. Nevertheless, it was still suffering from high variance and problematic convergence. As hypothesised and proved by the next runs of the model, this was caused by too high learning rate in both of the networks.
Furthermore, after implementation of the GAN, WGAN, and the WGANGP based on the original papers, they were expanded to allow for a conditional generation as proposed by Mirza & Osindero (2014). Conditional versions of these algorithms did not need any additional hypertuning to learn probabilistic distribution underlying the MNIST dataset adequately.
As described in section 5, an L1/L2 loss based model is also necessary for the expected reward generation. Therefore, in addition to GANs, a standard multilayer perceptron (MLP), optimising for the mean absolute error (L1 loss), has been implemented. L1 objective, instead of the most popular mean squared error (L2), has been chosen because it had been shown to work better for generative tasks (Zhao et al., 2017). The MLP was also initially debugged and hypertuned to fit the MNIST dataset. The comparison between implemented WGANGP and MLP on the MNIST dataset can be seen in figure Figure 2. It is a great example of the superiority of GANs over traditional models in highdimensional generative tasks.
What is also crucial, the property that all generative models followed equally, both for the MNIST dataset and as a part of the imaginative framework, was normalisation. Namely, all values from the data, both inputs and outputs, were scaled to fit the range $[0;1]$. It further optimised their performance but also made it easier to interpret results. The fact that the output space is a range $[0;1]$ is leveraged by employed evaluation metrics as described in section 6.3.
6.2.3 GAIRL
Once generative algorithms and the modelfree RL agent had been implemented, putting together the GAIRL agent begun. As explained in sections 4 and 5, the modelfree module consists of the RL agent, and two generative models take the place of the imagination.
In place of the MFM, the final version of Rainbow DQN described in section 6.2.1 is used. Precisely the same network structure and hyperparameters were employed for two reasons: Firstly, to entirely make sure that experimental results show actual merit of the GAIRL method, and are not merely caused by varying hyperparameters between the algorithms. Secondly, to test the promise of GAIRL to work as a universal sample efficiency enhancement in a very modular, plug and play manner, that does not require any special tuning of the RL algorithms employed within the MFM.
Although one of the central premises of the research presented in this thesis is the advantage of GANs for the next state generation part of the imagination module, it was plausible that on chosen, less dimensional benchmarks they may actually perform worse than the standard L1/L2 loss approaches due to their higher level of complexity. Therefore, two different versions of the IM were implemented: an MLPbased imagination and a WGANGPbased imagination. The MLPbased version uses an MLP model for both next state and reward generation. Whereas WGANGPbased sticks to the standard GAIRL premise of using a GAN for the next state and an MLP for the reward. Also, this decision rendered a good comparison between Generative Adversarial Imaginative Reinforcement Learning and potentially simpler version of the framework. For both versions, the output of the MLP used for the reward generation is rounded to the closest integer (to accustom for discrete rewards in the experimental environments).
Hyperparameter  Value 
Hidden layers  $[512,512]$ 
Hidden layers activation  Leaky ReLU 
Leakiness parameter (${\alpha}_{lrelu}$)  $0.2$ 
Dropout probability  $0$ 
Final layer activation [generator]  Tanh 
Final layer activation [critic]  Linear 
Penalty coefficient ($\lambda $)  10 
Critic steps per one generator step ($k$)  10 
Optimiser  Adam 
Learning rate (${\alpha}_{lr}$)  $2\times {10}^{4}$ 
First Adam decay rate (${\beta}_{1}$)  $0.5$ 
Second Adam decay rate (${\beta}_{2}$)  $0.9$ 


As described in section 2.3, an original GAN could be more computationally efficient than the WGANGP. However, a hard deadline on the thesis moved the focus on minimising the time necessary to tune hyperparameters, instead of minimising the computational efficiency.
What is more, both environments presented in this study have deterministic dynamics, and so generative models do not require any inherent stochasticity. Therefore the random noise input to the WGANGP was omitted. All of the hyperparameters, both for WGANGP and MLPs, had to also be customised to the new setting. Tables 2 and 3 show final hyperparameter choices for all of the generative models used for the final evaluation.
In practice, the imaginative framework also maintains two separate memory modules: one representing a training set, the other a test set. It is vital to analyse the capabilities of the imagination module thoroughly. The memory used for experiments consisted of $200000$ most recent samples from the environment, 80% of which was used for training and 20% was separated purely for assessing the IM. A critical characteristic of the memory was that it artificially increased the number of experience samples that result in terminal states by oversampling (Ling & Li, 1998). It was necessary due to a massive imbalance of terminal/nonterminal states in almost every RL setting. To do so, GAIRL keeps track of the mean length of episodes ${\mu}_{ep}$. Once the episode finishes (a transition sample with a terminal state occurs), the obtained sample is replicated $\lceil {\mu}_{ep}\rfloor $ many times.
As per algorithm 4.6, three training phases, MFP, ITP, and IBP continue in a loop starting from the modelfree phase. MFP was initially set to operate for ${\rho}_{1}=20000$ steps in the real environment, ITP for ${\rho}_{2}=20000$ stochastic gradient descent imagination training steps, and IBP for ${\rho}_{3}=60000$ steps in the imaginative environment. However, initial experiments showed that longer ITP with ${\rho}_{2}=40000$ steps better generalises on both environments and so the hyperparameter has been changed for the final version of the algorithm.
6.3 Evaluation metrics
Different metrics have been used to test different aspects of the algorithms. Precision (15) and recall (16) is employed to assess the performance of the reward generation (normalised reward in the chosen environment can be either $0$ or $1$). Precision describes a ratio of true positives (correctly generated $1$s) within all generated positives (all generated $1$s, no matter if correctly or not). Recall defines how many $1$s were correctly generated out of all true $1$s. They provide much more information about the real performance of the machine learning model than a standard accuracy metric, especially in situations of highclass imbalance that takes place in chosen environments (reward $0$ is extremely more common than $1$).
$$\text{precision}=\frac{\text{true positives}}{\text{true positives}+\text{false positives}}$$  (15) 
$$\text{recall}=\frac{\text{true positives}}{\text{true positives}+\text{false negatives}}$$  (16) 
State, on the other hand, follows fully continuous dynamics. Therefore, mean absolute error (MAE) (17) is used for stategenerating models. MAE is simply an averaged L1 loss that calculates a mean absolute difference between a generated state and a ground truth state. It has been chosen over more popular mean squared error for the same reasons L1 loss has been chosen over L2 loss (see section 6.2.2). Given that all outputs are normalised between $0$ and $1$, the value of $1\text{MAE}$ can also be referred to as accuracy.
$$\text{MAE}=\frac{1}{n}\sum _{i=1}^{n}{x}_{generated}{x}_{true}$$  (17) 
Finally, to evaluate data efficiency of algorithms (the main focus of this thesis), simply a mean reward from the $100$ most recent episodes, in regards to the number of steps performed in the real environment, is used.
7 Results
First, imagination capabilities to accurately approximate real environment dynamics are presented. Then, focus moves onto data efficiency of different variations of the imaginative framework and the stateoftheart Rainbow DQN. Furthermore, computational efficiency analysis of the proposed framework can be found in Appendix B. All experiments were performed using algorithms and parameters described in section 6.2, the same for both environments. Each of the showcased results is based on 15 independent runs of the algorithm. In all of the plots, lines represent a mean value for the runs. Opaque areas represent a standard deviation around the mean.
7.1 Imagination performance
7.1.1 Reward generation
Figure 11 shows the performance of the reward imagination submodule in terms of precision and recall in both environments. Both metrics, besides recall of Acrobot reward that requires twice as much experience, converge after $120000$ steps. For MountainCar, as it is a simple environment, the MLP can easily reach over $99\%$ recall and precision in a few GAIRL iterations. Acrobot is a bit more challenging and even converged imagination’s precision can drop below $99\%$. Although results of this magnitude are sufficiently accurate, they could potentially be improved if more time was spent on hypertuning the machine learning model.
What can be intriguing, are big jumps about every $60000$ steps in the performance of the network as the training progresses. They are caused by the the main GAIRL algorithm loop. For every GAIRL iteration, agent gathers $20000$ samples of experience from the real environment (${\rho}_{1}=20000)$ to then train the imagination for $60000$ gradient descent steps (${\rho}_{2}=60000$). Therefore, each multiple of $60000$ marks a point after which more real data enters the process helping the model to better generalise to unseen samples. I.e. in the first $60000$ steps imagination learns only based on $16000$ samples ($80\%$ of $20000$ because another $20\%$ belongs to the test memory), in the $60000120000$ period based on $32000$, in the $12000180000$ based on $48000$, and so on.
7.1.2 State generation
The effectiveness of the state imagination submodule is presented in Figure 12. Both Figure 11(b) and 11(c) show MAE for both WGANGP and MLP state generation. Figure 11(a), on the other hand, shows Wasserstein distance as approximated by the critic for WGANGP imagination only.
We can see that WGANGP performs worse than the MLP in terms of MAE. This is, however, expected behaviour. MLP optimises for the L1 loss, that is in essence MAE multiplied by a constant, directly. GANs superiority lays in the fact that they do not optimise towards minimising mean error on the individual level, but towards minimising the difference between data distributions. These plots only showcase that indeed both of the modules seem to model dynamics of the real distribution accurately. Using them for direct comparison of the models would be unfair for the WGANGP.
MLP reaches over $99.9\%$ accuracy on MountainCar and $99.5\%$ accuracy on Acrobot, even without much of hypertuning. Unlike with the reward, very high accuracy of state generation is crucial because the correctness of future states highly depends on the correctness of previous states. When making rollouts into the future using pure imagination, errors may compound. Given accuracy $a$ of the state generation model and rollout of length $\tau $, final state’s accuracy may drop, in the worst case, to ${a}^{\tau}$ (nevertheless, this is also unlikely).
Even WGANGP that does not optimise for MAE reaches good results of over $97\%$ accuracy for both environments proving its convergence properties. This is also shown by the estimated Wasserstein distance between the generator distribution and the environment distribution that reaches values lower than $0.015$ for both environments (even below $0.005$ for MountainCar).
These results show a high promise of replacing the real environment with the imagination, especially considering the fact that not much time and resources have been spent on optimising the architecture and parameters used in the experiments.
7.2 Data efficiency
Data efficiency is the main problem targeted in this study. Results in section 7.1 show that deep learning models can efficiently learn the dynamics of the real environment. The remaining question is whether these models are accurate enough to replace the real environment in the RL process.
Figure 13 presents results of both MLPbased and WGANGPbased imaginative framework in comparison to the imaginationfree stateoftheart algorithm. Both imaginative algorithms highly outperform imaginationfree Rainbow DQN being more than twice as data efficient on both environments.
GAIRL requires even as much as over three times fewer experience samples than the framework with MLPbased imagination, and over 6 times less than the current stateoftheart on the more complex Acrobot environment. Surprisingly, despite the simplicity of MountainCar setting, GAIRL also solves it most optimally.
What is also important, one of the concerns was that imaginationaided agent would be much less stable than the standard modelfree algorithm. It is indeed the case on the simple MountainCar, especially for WGANGPbased framework. Remarkably, once the complexity of the environment increases (Acrobot), it is no longer the case.
In addition, twotailed Wilcoxon signed rank tests were performed to statistically compare obtained results. There were $N=15$ nonzero difference samples in every comparison. The critical value to make sure that the results are statistically significant ($p\le 0.05$) for $N=15$ samples is ${\omega}_{c}=25$
Firstly, the comparison between the imaginationfree state of the art, and the mlpbased imaginative algorithm was made. For the MountainCar environment, Wilcoxon test produced rank sums ${T}_{{m}_{1}}^{+}=114$ and ${T}_{{m}_{1}}^{}=6$. Clearly, $$. Similarly, in the Acrobot environment rank sums ${T}_{{a}_{1}}^{+}=111$ and ${T}_{{a}_{1}}^{}=9$ were collected. Again, $$. Therefore, the null hypothesis stating that the imaginative framework does not improve sample efficiency of the state of the art can be safely rejected.
Then, the focus moved to the analysis of GAIRL benefits over the more primitive MLPbased imaginative framework. Rank sums for the MountainCar environment in this scenario are ${T}_{{m}_{2}}^{+}=82$ and ${T}_{{m}_{2}}^{}=38$. This time, ${\omega}_{{m}_{2}}>{\omega}_{c}$. Nevertheless, for the Acrobot, ${T}_{{a}_{2}}^{+}=102$ and ${T}_{{a}_{2}}^{}=18$, so $$. Thus, the superiority of GANs in the generative framework is statistically significant only for the more complicated environment. This result was somewhat expected; the central promise of GANs is to work much better on complex and high dimensional domains. The MLP should not be much worse in an elementary environment like MountainCar.
Full table with the data that was used to calculate test statistics is available in Appendix A.
8 Discussion
Two research questions have been posed at the beginning of this thesis:

•
Can learning the imaginative model of the environment be more efficient than learning an optimal policy?

•
If so, can the learned imagination fulfil the promise of sample efficient modelbased RL in settings where dynamics of the real environment are unknown?
Both of the answers have been subsequently answered: Results presented in section 7 clearly show that simply exploiting Markov property allows imagination to converge with almost order of magnitude less data than it is required for learning an optimal policy. Additionally, the imagination was later successfully used to improve on data efficiency of the stateoftheart Rainbow DQN algorithm.
Initially, the thesis also hypothesised that the GAIRL framework presented throughout the thesis could produce a positive answer to both of the questions. Although in the end it did, one part of the hypothesis has been shown to be redundant in terms of simply answering these questions. Namely, generative adversarial architecture is not a necessity. More traditional models like multilayer perceptron can be successfully deployed within the imaginative framework as well. At least in simple environments used for the evaluation. Nevertheless, although not necessary, GANs tend to produce better results and should scale better to more complex settings.
Two main novel contributions of this thesis would be:

•
Efficient learning of the environment dynamics (creating imagination) by leveraging Markov property and advantages of generative adversarial networks.

•
Successful use of imagination to highly improve stateoftheart data efficiency of deep reinforcement learning through DynaQinspired algorithm.
Unfortunately, a month before the final submission deadline, Kaiser et al. (2019) introduced SimPLe – a novel deep RL algorithm that follows similar principles to the GAIRL framework. It was also inspired by the DynaQ algorithm and produced substantially more data efficient results than the Rainbow DQN. Therefore, it strips the novelty out of the second contribution of this thesis. However, because SimPLe leverages traditional L1/L2 loss model proposed by Oh et al. (2015) with only a few minor modifications, it often suffers from inaccurate generation (blurry images, disappearing small but crucial features). GAIRL’s approach of using GANs promises to circumvent this difficulty. However, more benchmarks in more complex environments should be performed to prove this hypothesis fully. This brings us to the point in the next paragraph.
GAIRL seems to overperform most recent stateoftheart and even introduce crucial contribution on top of the similar advancement that was produced by the top research institution in parallel. Nevertheless, the set of environments used for testing was limited due to strict time constraints. The framework needs to be evaluated on a higher variety of domains. Each of them should reflect one or more of the following properties: a very high dimensional state space, nondeterministic transition dynamics, or a partially observable Markov decision process (POMDP). GANs architecture promises to handle the first two smoothly, theoretically amplifying the gap between GAIRL and other approaches. However, GAIRL may perform worse in environments following the POMDP properties. Additionally, it should be empirically compared with the recently introduced SimPLe algorithm. Further experiments are the critical next step and are planned to continue after the submission of this thesis.
What is more, more detailed optimisation of GAIRL architecture has been left for future work. Not much focus has been given to hyperparameters. It would be interesting to see a thorough study of the best parameter choices for the GAIRL setting. Potentially, sample efficiency could have been even more significantly improved by decreasing the length of the modelfree phase and compensating it with the agent’s training in the imaginationbased phase. Additionally, reward and state generation should be combined within a single machine learning model. It is especially essential for the reward generation so it can utilise the benefits of GANs in stochastic environments.
Another promising direction is to modify GAIRL’s memory to follow a similar structure to the prioritised replay buffer (Schaul et al., 2015) that is used in the Rainbow DQN. It could lead to intriguing results. Moreover, current use of Wasserstein GAN could be utilised to guide the explorationexploitation tradeoff. As Azizzadenesheli et al. (2018) mentioned, high Wasserstein distance produced by the critic for certain states could indicate that the agent is unsure of the possible outcomes of such a state. Therefore, the agent should potentially move there to explore the search space better, even if it seems less optimal.
Finally, the length of the imagination training phase ${\rho}_{2}$ could become adaptive. Currently, it takes a fixed constant. However, imagination does not need the same amount of training time at every iteration. The first iteration should take the longest, as the imagination starts without any prior knowledge. However, during the second iteration, it already has a general overview of the world. It just needs to update its model slightly to capture newly gathered datapoints. Consequently, by the law of large numbers, it could follow that if $n$ indicates the number of iterations then ${lim}_{n\to inf}{\rho}_{2}=0$, potentially saving a substantial amount of computational power.
One of the main issues raised during the project demonstration session was that the presented comparison of GAIRL with Rainbow DQN is not fair towards the latter. The argument was that the GAIRL framework performs a considerable amount of additional computation in the background. The fact that GAIRL requires more computation is true. Although it has the same time complexity in terms of the big O notation, it performs slower and is less computationally efficient as shown in Appendix B. Naturally, using the real environment directly instead of its imperfect imagination allows the agent to find the optimal policy quicker, even without mentioning the computational power required to learn the imagination.
Nonetheless, data efficiency, not computational efficiency, is one of the most pressing problems in the field (Irpan, 2018), and the focus of this thesis. Computational power is cheap and widely accessible. Performing millions of random trial and error actions in the real environment, however, is often either very expensive or even impossible as explained throughout the thesis. Furthermore, this type of comparison is not new to the field. It has been used repeatedly in the literature (Mnih et al., 2015; Silver et al., 2016, 2017; Hessel et al., 2017; Schulman et al., 2017; Kaiser et al., 2019)
Another argument was that the imaginative module would not be able to grasp the dynamics of complex environments. Example of the real world in case of robotics was given. It is a reasonable concern. We cannot know for sure before performing appropriate experiments in such an environment as already mentioned earlier. However, the imagination module requires the same assumptions and the same setting as the current stateoftheart RL algorithms. In theory, if imagination cannot encapsulate the real environment, then the stateoftheart is not able to learn a closetooptimal policy either. Notwithstanding, this question is open for future experimental work.
The last point referred to employed benchmark environments. Namely, what is the point of learning the imaginative simulation for the environment that already is a simulation of a car/robot? Undeniably, it does not make sense in practice. However, the goal of this research was not to find the best solution to the Acrobot or MountainCar. It was about improving the generaluse stateoftheart reinforcement learning. Simulations were used merely as a mean to compare the capabilities of presented RL algorithms. Optimal or closetooptimal solutions to these environments, much better than any RL algorithm, has been devised decades ago. It did not stop, however, RL researchers to use them for benchmark purposes on conferences such as NeurIPS or ICML.
9 Conclusion
The goal of this thesis was to improve data efficiency of the stateoftheart reinforcement learning. This has been successfully achieved by introducing an imaginative framework that can accurately and efficiently approximate dynamics of the real environment by making use of Markov property and generative adversarial models.
It presented experimental evidence that supports the superiority of GANs over standard generative models for conditional state prediction within the reinforcement learning setting. It also introduced a way to utilise imperfect approximation of the real world to limit the amount of data needed to train an optimally behaving agent.
Similar advancement regarding sampleefficient reinforcement learning has been released in parallel (Kaiser et al., 2019). The study presented here, however, not only confirms the results obtained by Kaiser et al. (2019) but also adds an important contribution to the field that promises to improve the stateoftheart even further.
Nevertheless, more experiments are needed to ensure the superiority of the introduced framework, and there is still room open for future work.
References
 (1)

Abadi et al. (2015)
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado,
G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp,
A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M.,
Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C.,
Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P.,
Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P.,
Wattenberg, M., Wicke, M., Yu, Y. & Zheng, X. (2015), ‘TensorFlow: Largescale machine learning on
heterogeneous systems’.
Software available from tensorflow.org.
http://tensorflow.org/  Arjovsky et al. (2017) Arjovsky, M., Chintala, S. & Bottou, L. (2017), Wasserstein generative adversarial networks, in ‘International Conference on Machine Learning’, pp. 214–223.
 Arulkumaran et al. (2017) Arulkumaran, K., Deisenroth, M. P., Brundage, M. & Bharath, A. A. (2017), ‘A brief survey of deep reinforcement learning’, arXiv preprint arXiv:1708.05866 .
 Azizzadenesheli et al. (2018) Azizzadenesheli, K., Yang, B., Liu, W., Brunskill, E., Lipton, Z. & Anandkumar, A. (2018), ‘Surprising negative results for generative adversarial tree search’.
 BarthMaron et al. (2018) BarthMaron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan, D., Muldal, A., Heess, N. & Lillicrap, T. (2018), ‘Distributed distributional deterministic policy gradients’, arXiv preprint arXiv:1804.08617 .
 Bellemare et al. (2017) Bellemare, M. G., Dabney, W. & Munos, R. (2017), A distributional perspective on reinforcement learning, in ‘International Conference on Machine Learning’, pp. 449–458.
 Bellemare, Naddaf, Veness & Bowling (2013) Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. (2013), ‘The arcade learning environment: An evaluation platform for general agents’, Journal of Artificial Intelligence Research 47, 253–279.
 Bellemare, Veness & Bowling (2013) Bellemare, M., Veness, J. & Bowling, M. (2013), Bayesian learning of recursively factored environments, in ‘International Conference on Machine Learning’, pp. 1211–1219.
 Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J. & Zaremba, W. (2016), ‘Openai gym’.
 Browne et al. (2012) Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S. & Colton, S. (2012), ‘A survey of monte carlo tree search methods’, IEEE Transactions on Computational Intelligence and AI in games 4(1), 1–43.
 Chiappa et al. (2017) Chiappa, S., Racaniere, S., Wierstra, D. & Mohamed, S. (2017), ‘Recurrent environment simulators’, arXiv preprint arXiv:1704.02254 .
 Coppersmith & Winograd (1990) Coppersmith, D. & Winograd, S. (1990), ‘Matrix multiplication via arithmetic progressions’, Journal of symbolic computation 9(3), 251–280.
 Doan et al. (2018) Doan, T., Mazoure, B. & Lyle, C. (2018), ‘Gan qlearning’, arXiv preprint arXiv:1805.04874 .
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A. & Bengio, Y. (2014), Generative adversarial nets, in ‘Advances in neural information processing systems’, pp. 2672–2680.
 Gu et al. (2016) Gu, S., Lillicrap, T., Sutskever, I. & Levine, S. (2016), Continuous deep qlearning with modelbased acceleration, in ‘International Conference on Machine Learning’, pp. 2829–2838.

Gulrajani et al. (2017)
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. & Courville,
A. C. (2017), Improved training of
wasserstein gans, in I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
R. Fergus, S. Vishwanathan & R. Garnett, eds, ‘Advances in Neural
Information Processing Systems 30’, Curran Associates, Inc., pp. 5767–5777.
http://papers.nips.cc/paper/7159improvedtrainingofwassersteingans.pdf  Hassabis et al. (2007) Hassabis, D., Kumaran, D., Vann, S. D. & Maguire, E. A. (2007), ‘Patients with hippocampal amnesia cannot imagine new experiences’, Proceedings of the National Academy of Sciences 104(5), 1726–1731.
 Henderson et al. (2018) Henderson, P., Chang, W.D., Bacon, P.L., Meger, D., Pineau, J. & Precup, D. (2018), Optiongan: Learning joint rewardpolicy options using generative adversarial inverse reinforcement learning.
 Hessel et al. (2017) Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M. & Silver, D. (2017), ‘Rainbow: Combining improvements in deep reinforcement learning’, arXiv preprint arXiv:1710.02298 .
 Ho & Ermon (2016) Ho, J. & Ermon, S. (2016), Generative adversarial imitation learning, in ‘Advances in Neural Information Processing Systems’, pp. 4565–4573.
 Hochreiter & Schmidhuber (1997) Hochreiter, S. & Schmidhuber, J. (1997), ‘Long shortterm memory’, Neural computation 9(8), 1735–1780.
 Irpan (2018) Irpan, A. (2018), ‘Deep reinforcement learning doesn’t work yet’, https://www.alexirpan.com/2018/02/14/rlhard.html.
 Isola et al. (2017) Isola, P., Zhu, J.Y., Zhou, T. & Efros, A. A. (2017), Imagetoimage translation with conditional adversarial networks, in ‘2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)’, IEEE, pp. 5967–5976.
 Kaiser et al. (2019) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R. H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S. et al. (2019), ‘Modelbased reinforcement learning for atari’, arXiv preprint arXiv:1903.00374 .
 Kingma & Ba (2014) Kingma, D. P. & Ba, J. (2014), ‘Adam: A method for stochastic optimization’, arXiv preprint arXiv:1412.6980 .
 Kingma & Welling (2013) Kingma, D. P. & Welling, M. (2013), ‘Autoencoding variational bayes’, arXiv preprint arXiv:1312.6114 .
 LeCun et al. (1999) LeCun, Y., Haffner, P., Bottou, L. & Bengio, Y. (1999), Object recognition with gradientbased learning, in ‘Shape, contour and grouping in computer vision’, Springer, pp. 319–345.
 Leibfried et al. (2016) Leibfried, F., Kushman, N. & Hofmann, K. (2016), ‘A deep learning approach for joint video frame and reward prediction in atari games’, arXiv preprint arXiv:1611.07078 .
 Leinweber et al. (2017) Leinweber, M., Ward, D. R., Sobczak, J. M., Attinger, A. & Keller, G. B. (2017), ‘A sensorimotor circuit in mouse cortex for visual flow predictions’, Neuron 95(6), 1420–1432.
 Lenz et al. (2015) Lenz, I., Knepper, R. A. & Saxena, A. (2015), Deepmpc: Learning deep latent features for model predictive control., in ‘Robotics: Science and Systems’.
 Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D. & Wierstra, D. (2015), ‘Continuous control with deep reinforcement learning’, arXiv preprint arXiv:1509.02971 .
 Ling & Li (1998) Ling, C. X. & Li, C. (1998), Data mining for direct marketing: Problems and solutions., in ‘Kdd’, Vol. 98, pp. 73–79.
 Lucic et al. (2018) Lucic, M., Kurach, K., Michalski, M., Gelly, S. & Bousquet, O. (2018), Are gans created equal? a largescale study, in ‘Advances in neural information processing systems’, pp. 698–707.
 Maas et al. (2013) Maas, A. L., Hannun, A. Y. & Ng, A. Y. (2013), Rectifier nonlinearities improve neural network acoustic models, in ‘Proc. icml’, Vol. 30, p. 3.
 Medsker & Jain (1999) Medsker, L. & Jain, L. C. (1999), Recurrent neural networks: design and applications, CRC press.
 Mirza & Osindero (2014) Mirza, M. & Osindero, S. (2014), ‘Conditional generative adversarial nets’, arXiv preprint arXiv:1411.1784 .
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G. et al. (2015), ‘Humanlevel control through deep reinforcement learning’, Nature 518(7540), 529.
 Moore (1990) Moore, A. W. (1990), ‘Efficient memorybased learning for robot control’.
 Nair & Hinton (2010) Nair, V. & Hinton, G. E. (2010), Rectified linear units improve restricted boltzmann machines, in ‘Proceedings of the 27th international conference on machine learning (ICML10)’, pp. 807–814.
 Ng et al. (2000) Ng, A. Y., Russell, S. J. et al. (2000), Algorithms for inverse reinforcement learning., in ‘Icml’, pp. 663–670.
 Oh et al. (2015) Oh, J., Guo, X., Lee, H., Lewis, R. L. & Singh, S. (2015), Actionconditional video prediction using deep networks in atari games, in ‘Advances in neural information processing systems’, pp. 2863–2871.
 Peters et al. (2005) Peters, J., Vijayakumar, S. & Schaal, S. (2005), Natural actorcritic, in ‘European Conference on Machine Learning’, Springer, pp. 280–291.
 Pfau & Vinyals (2016) Pfau, D. & Vinyals, O. (2016), ‘Connecting generative adversarial networks and actorcritic methods’, arXiv preprint arXiv:1610.01945 .
 Pfeiffer & Foster (2013) Pfeiffer, B. E. & Foster, D. J. (2013), ‘Hippocampal placecell sequences depict future paths to remembered goals’, Nature 497(7447), 74.
 Racanière et al. (2017) Racanière, S., Weber, T., Reichert, D., Buesing, L., Guez, A., Rezende, D. J., Badia, A. P., Vinyals, O., Heess, N., Li, Y. et al. (2017), Imaginationaugmented agents for deep reinforcement learning, in ‘Advances in Neural Information Processing Systems’, pp. 5690–5701.
 Schaal et al. (2003) Schaal, S., Ijspeert, A. & Billard, A. (2003), ‘Computational approaches to motor learning by imitation’, Philosophical Transactions of the Royal Society of London B: Biological Sciences 358(1431), 537–547.
 Schacter et al. (2012) Schacter, D. L., Addis, D. R., Hassabis, D., Martin, V. C., Spreng, R. N. & Szpunar, K. K. (2012), ‘The future of memory: remembering, imagining, and the brain’, Neuron 76(4), 677–694.
 Schaul et al. (2015) Schaul, T., Quan, J., Antonoglou, I. & Silver, D. (2015), ‘Prioritized experience replay’, arXiv preprint arXiv:1511.05952 .
 Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. (2017), ‘Proximal policy optimization algorithms’, arXiv preprint arXiv:1707.06347 .
 Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. et al. (2016), ‘Mastering the game of go with deep neural networks and tree search’, nature 529(7587), 484.
 Silver et al. (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D. & Riedmiller, M. (2014), Deterministic policy gradient algorithms, in ‘ICML’.
 Silver et al. (2017) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A. et al. (2017), ‘Mastering the game of go without human knowledge’, Nature 550(7676), 354.
 Sutton (1990) Sutton, R. S. (1990), Integrated architectures for learning, planning, and reacting based on approximating dynamic programming, in ‘Machine Learning Proceedings 1990’, Elsevier, pp. 216–224.
 Sutton (1996) Sutton, R. S. (1996), Generalization in reinforcement learning: Successful examples using sparse coarse coding, in ‘Advances in neural information processing systems’, pp. 1038–1044.
 Sutton & Barto (1998) Sutton, R. S. & Barto, A. G. (1998), Introduction to reinforcement learning, Vol. 135, MIT press Cambridge.
 Talvitie (2014) Talvitie, E. (2014), Model regularization for stable sample rollouts., in ‘UAI’, pp. 780–789.
 Talvitie (2015) Talvitie, E. (2015), Agnostic system identification for monte carlo planning, in ‘TwentyNinth AAAI Conference on Artificial Intelligence’.
 Tesauro (1995) Tesauro, G. (1995), Tdgammon: A selfteaching backgammon program, in ‘Applications of Neural Networks’, Springer, pp. 267–285.
 Tolman (1948) Tolman, E. C. (1948), ‘Cognitive maps in rats and men.’, Psychological review 55(4), 189.
 Van Hasselt (2010) Van Hasselt, H. (2010), Double qlearning, in ‘Advances in Neural Information Processing Systems’, pp. 2613–2621.
 Van Hasselt et al. (2016) Van Hasselt, H., Guez, A. & Silver, D. (2016), Deep reinforcement learning with double qlearning., in ‘AAAI’, Vol. 2, Phoenix, AZ, p. 5.
 Venkatraman et al. (2016) Venkatraman, A., Capobianco, R., Pinto, L., Hebert, M., Nardi, D. & Bagnell, J. A. (2016), Improved learning of dynamics models for control, in ‘International Symposium on Experimental Robotics’, Springer, pp. 703–713.
 Villani (2008) Villani, C. (2008), Optimal transport: old and new, Vol. 338, Springer Science & Business Media.
 Wang et al. (2015) Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M. & De Freitas, N. (2015), ‘Dueling network architectures for deep reinforcement learning’, arXiv preprint arXiv:1511.06581 .
 Watkins & Dayan (1992) Watkins, C. J. & Dayan, P. (1992), ‘Qlearning’, Machine learning 8(34), 279–292.
 Williams (1992) Williams, R. J. (1992), ‘Simple statistical gradientfollowing algorithms for connectionist reinforcement learning’, Machine learning 8(34), 229–256.
 Wilson et al. (2017) Wilson, A. C., Roelofs, R., Stern, M., Srebro, N. & Recht, B. (2017), The marginal value of adaptive gradient methods in machine learning, in ‘Advances in Neural Information Processing Systems’, pp. 4148–4158.
 Xiao & Kesineni (2016) Xiao, T. & Kesineni, G. (2016), ‘Generative adversarial networks for model based reinforcement learning with tree search’.
 Zeiler et al. (2010) Zeiler, M. D., Krishnan, D., Taylor, G. W. & Fergus, R. (2010), ‘Deconvolutional networks’.
 Zhao et al. (2017) Zhao, H., Gallo, O., Frosio, I. & Kautz, J. (2017), ‘Loss functions for image restoration with neural networks’, IEEE Transactions on Computational Imaging 3(1), 47–57.
Appendix A Numerical results for data efficiency
Environment  Random seed  Imaginationfree  MLP imagination  WGANGP imagination 

MountainCar  10  490.1  856.1  292.6 
MountainCar  50  1314.4  625.2  255.7 
MountainCar  100  821.5  397.3  300.3 
MountainCar  500  514.6  297.2  325 
MountainCar  1000  970.2  415.3  395.8 
MountainCar  5000  510.8  245.6  134 
MountainCar  10000  950.5  222.3  317.8 
MountainCar  50000  516.3  372.6  336.1 
MountainCar  100000  1156  390.9  556.5 
MountainCar  500000  881.3  531.5  236 
MountainCar  1000000  726.1  298.3  217.1 
MountainCar  5000000  687.9  207.7  210.6 
MountainCar  10000000  791.7  456.1  411 
MountainCar  50000000  1095  369  335.5 
MountainCar  100000000  935.2  377.2  533.3 
Acrobot  10  641.2  205.1  71.2 
Acrobot  50  814.5  74.3  94.4 
Acrobot  100  328.4  118.1  71.4 
Acrobot  500  153.6  152.7  163.3 
Acrobot  1000  548.8  75.8  150 
Acrobot  5000  207.5  96.3  119.9 
Acrobot  10000  254.7  87  65.1 
Acrobot  50000  601  317.6  132.1 
Acrobot  100000  176.2  359.7  100 
Acrobot  500000  653.2  230.2  151.6 
Acrobot  1000000  285.9  291.1  124.3 
Acrobot  5000000  399.7  117.9  101.3 
Acrobot  10000000  397.2  199.3  74.1 
Acrobot  50000000  318.1  172  86.8 
Acrobot  100000000  324.1  139.7  68.8 
Appendix B Computational efficiency analysis
Table 5 presents a comparison of the algorithms in regards to the time they required until convergence on a 4CPU instance. For both Acrobot and MountainCar, imaginative framework employing MLP for the imagination module requires approximately 7 times more time to converge to the optimal policy than the imaginationfree agent. Full GAIRL algorithms run approximately 2.5 times longer than the MLPbased agent and 17 times longer than the imaginationfree agent.
It clearly shows that the stateoftheart is much more computationally efficient than any of the proposed algorithms. Combining it with the results from section 7, we can observe that there is an inverse proportional relation between data efficiency and computational efficiency.
However, GAIRL performance should scale better with higher computational power. The most demanding part of GAIRL is the imagination training phase. The ITP employs a standard supervised learning process that can easily leverage high parallelism provided by GPUs and TPUs. Highly sequential reinforcement learning, on the other hand, is much harder to parallelise. Naturally, GAIRL can never reach the same efficiency as imaginationfree options; however, it may get asymptotically closer when more powerful hardware is provided.
Environment  Random seed  WGANGP imagination  MLP imagination  Imaginationfree 

MountainCar  10  464.534  298.221  49.078 
MountainCar  50  822.961  304.695  55.173 
MountainCar  100  651.695  212.879  29.919 
MountainCar  500  1047.146  414.624  31.019 
MountainCar  1000  334.610  232.937  46.608 
MountainCar  5000  652.493  340.321  33.857 
MountainCar  10000  333.088  246.786  35.686 
MountainCar  50000  285.082  133.015  42.096 
MountainCar  100000  906.235  375.812  32.433 
MountainCar  500000  761.129  111.714  59.211 
MountainCar  1000000  615.542  286.956  31.195 
MountainCar  5000000  741.330  381.619  42.303 
MountainCar  10000000  936.039  152.674  41.251 
MountainCar  50000000  695.203  166.715  36.890 
MountainCar  100000000  603.739  225.416  35.841 
Mean  –  656.722  258.959  40.171 
Acrobot  10  119.521  138.605  13.962 
Acrobot  50  104.571  84.227  41.711 
Acrobot  100  312.638  216.284  20.661 
Acrobot  500  536.062  192.328  8.235 
Acrobot  1000  334.610  127.010  26.104 
Acrobot  5000  485.868  96.694  10.359 
Acrobot  10000  333.088  209.477  22.162 
Acrobot  50000  152.203  70.388  10.528 
Acrobot  100000  259.356  185.065  30.179 
Acrobot  500000  605.006  28.556  19.241 
Acrobot  1000000  262.557  125.856  12.956 
Acrobot  5000000  116.214  171.756  11.067 
Acrobot  10000000  335.291  143.612  14.851 
Acrobot  50000000  224.329  91.233  26.151 
Acrobot  100000000  313.412  185.561  19.451 
Mean  –  335.155  137.777  19.175 
Appendix C Generative hyperparameters for MNIST
Hyperparameter  Value 

Hidden layers  [256, 512, 1024] 
Hidden layers activation  Leaky ReLU 
Leakiness parameter (${\alpha}_{lrelu}$)  $0.2$ 
Dropout probability  $0$ 
Final layer activation  Tanh 
Optimiser  Adam 
Learning rate (${\alpha}_{lr}$)  $2\times {10}^{4}$ 
First Adam decay rate (${\beta}_{1}$)  $0.9$ 
Second Adam decay rate (${\beta}_{2}$)  $0.999$ 
Hyperparameter  Value 

Generator layers  [256, 512, 1024] 
Generator final layer activation  Tanh 
Critic layers  [1024, 1024, 1024] 
Critic final layer activation  Linear 
Critic steps for one generator step  $10$ 
Hidden layers activation  Leaky ReLU 
Leakiness parameter (${\alpha}_{lrelu}$)  $0.2$ 
Dropout probability  $0$ 
Noise size  $100$ 
Penalty coefficient  $10$ 
Optimiser  Adam 
Learning rate (${\alpha}_{lr}$)  $2\times {10}^{4}$ 
First Adam decay rate (${\beta}_{1}$)  $0.5$ 
Second Adam decay rate (${\beta}_{2}$)  $0.9$ 
Hyperparameter  Value 
Generator layers  [256, 512, 1024] 
Generator dropout probability  $0$ 
Generator final layer activation  Tanh 
Discriminator layers  [1024, 512, 256] 
Discriminator dropout probability  $0.2$ 
Discriminator final layer activation  Sigmoid 
Discriminator steps for one generator step  $1$ 
Hidden layers activation  Leaky ReLU 
Leakiness parameter (${\alpha}_{lrelu}$)  $0.2$ 
Noise size  100 
Optimiser  Adam 
Learning rate (${\alpha}_{lr}$)  $2\times {10}^{4}$ 
First Adam decay rate (${\beta}_{1}$)  $0.9$ 
Second Adam decay rate (${\beta}_{2}$)  $0.999$ 