Abstract
Deep Reinforcement Learning (DRL) algorithms are known to be datainefficient. One reason is that a DRL agent learns both the feature and thepolicy tabula rasa. Integrating prior knowledge into DRL algorithms is one wayto improve learning efficiency since it helps to build helpful representations.In this work, we consider incorporating human knowledge to accelerate theasynchronous advantage actorcritic (A3C) algorithm by pretraining a smallamount of nonexpert human demonstrations. We leverage the supervisedautoencoder framework and propose a novel pretraining strategy that jointlytrains a weighted supervised classification loss, an unsupervisedreconstruction loss, and an expected return loss. The resulting pretrainedmodel learns more useful features compared to independently training insupervised or unsupervised fashion. Our pretraining method drasticallyimproved the learning performance of the A3C agent in Atari games of Pong andMsPacman, exceeding the performance of the stateoftheart algorithms at amuch smaller number of game interactions. Our method is lightweight and easyto implement in a single machine. For reproducibility, our code is available atgithub.com/gabrieledcjr/DeepRL/tree/A3CALA2019
Quick Read (beta)
Jointly Pretraining with Supervised, Autoencoder, and Value Losses for Deep Reinforcement Learning
Abstract
Deep Reinforcement Learning (DRL) algorithms are known to be data inefficient. One reason is that a DRL agent learns both the feature and the policy tabula rasa. Integrating prior knowledge into DRL algorithms is one way to improve learning efficiency since it helps to build helpful representations. In this work, we consider incorporating human knowledge to accelerate the asynchronous advantage actorcritic (A3C) algorithm by pretraining a small amount of nonexpert human demonstrations. We leverage the supervised autoencoder framework and propose a novel pretraining strategy that jointly trains a weighted supervised classification loss, an unsupervised reconstruction loss, and an expected return loss. The resulting pretrained model learns more useful features compared to independently training in supervised or unsupervised fashion. Our pretraining method drastically improved the learning performance of the A3C agent in Atari games of Pong and MsPacman, exceeding the performance of the stateoftheart algorithms at a much smaller number of game interactions. Our method is lightweight and easy to implement in a single machine. For reproducibility, our code is available at github.com/gabrieledcjr/DeepRL/tree/A3CALA2019
Jointly Pretraining with Supervised, Autoencoder, and Value Losses for Deep Reinforcement Learning
A Preprint
Gabriel V. de la Cruz Jr., Yunshu Du and Matthew E. Taylor
School of Electrical Engineering and Computer Science
Washington State University
Pullman, WA 991642752
{gabriel.delacruz,yunshu.du,matthew.e.taylor}@wsu.edu
June 19, 2019
Keywords Reinforcement Learning; Deep learning; Learning from Humans
1 Introduction
Deep Reinforcement Learning (DRL) has been an increasingly popular general machine learning technique, significantly contributing to the resurgence in neural networks research. Not only can DRL allow machine learning algorithms to learn appropriate representations without extensive handcrafting of input, it can also achieve recordsetting performance across multiple types of problems Mnih et al. (2015); Mnih et al. (2016); Silver et al. (2016); Silver et al. (2018). However, one of the main drawbacks of DRL is its data complexity. Similar to classic RL algorithms, DRL suffers from slow initial learning as it learns tabular rasa. While acceptable in simulated environments, the long learning time of DRL has made it impractical for realworld problems where bad initial performance is unaffordable, such as in robotics, selfdriving cars, and health care applications Bojarski et al. (2017); Li (2017); Miotto et al. (2017).
There are two components of learning in DRL: feature learning and policy learning. While DRL is able to directly extract features using a deep neural network as its nonlinear function approximator, this process adds additional training time on top of policy learning and consequently slows down DRL algorithms. In this work, we propose several pretraining techniques to tackle the feature learning problem in DRL. We believe that by aiding one of the learning components, a DRL agent will be able to focus more on the policy learning thus improve the overall learning speed.
Many techniques have been proposed to address the data inefficiency of DRL. Transfer learning has been shown to work well for RL problems Taylor and Stone (2009). The intuition is that knowledge acquired from previously learned source tasks can be transferred to related target tasks such that the target tasks learn faster since they are not learning from scratch. Learning from demonstrations (LfD) Argall et al. (2009); Hester et al. (2018); Pohlen et al. (2018); Gao et al. (2018) is also an effective way to accelerate learning. In particular, demonstration data of a task can be collected from either a human demonstrator or a pretrained agent; a new agent can start with mimicking the demonstrator’s behavior to obtain a reasonable initial policy quickly, and later on move away from the demonstrator and learns on its own. One can also leverage additional auxiliary losses to gather extra information about a task Jaderberg et al. (2017); Mirowski et al. (2016); Schmitt et al. (2018); Du et al. (2018). For example, an agent can jointly optimize the policy loss and an unsupervised reconstruction loss; doing so explicitly encourages learning the features. In this work, we combine the flavor of the methods above and propose a pretraining strategy to speed up learning. Our method jointly pretrains a supervised classification loss, an unsupervised reconstruction loss, and a value function loss.
2 Preliminaries
2.1 Deep Reinforcement Learning
We consider a reinforcement learning (RL) problem that is modeled using a Markov Decision Process (MDP), represented by a 5tuple $\u27e8S,A,P,R,\gamma \u27e9$. A state ${S}_{t}$ represents the environment at time $t$. An agent learns what action ${A}_{t}\in \mathcal{A}(s)$ to take in ${S}_{t}$ by interacting with the environment. A reward ${R}_{t+1}\in \mathcal{R}\subset \mathbb{R}$ is given based on the action executed and the next state reached, ${S}_{t+1}$. The goal is to maximize the expected cumulative return ${G}_{t}={\sum}_{k=0}^{\mathrm{\infty}}{\gamma}^{k}{R}_{t+k+1}$, where $\gamma \in [0,1]$ is a discount factor that determines the relative importance of future and immediate rewards Sutton and Barto (2018).
In valuebased RL algorithms, an agent learns the stateaction value function ${Q}^{\pi}(s,a)={\mathbb{E}}_{{s}^{\prime}}[r+\gamma {\mathrm{max}}_{{a}^{\prime}}{Q}^{\pi}({s}^{\prime},{a}^{\prime})s,a]$, and the optimal value function ${Q}^{*}(s,a)=ma{x}_{\pi}{Q}^{\pi}(s,a)$ gives the expected return for taking an action $a$ at state $s$ and thereafter following an optimal policy. However, directly computing Q values is not feasible when the state space is large. The deep Qnetwork (DQN) algorithm Mnih et al. (2015) uses a deep neural network (parameterized as $\theta $) to approximate the Q function as $Q(s,a;\theta )\approx {Q}^{*}(s,a)$. At each iteration i, DQN minimizes the loss
$${L}_{i}({\theta}_{i})={\mathbb{E}}_{s,a,r,{s}^{\prime}}\left[{(yQ(s,a;{\theta}_{i}))}^{2}\right]$$ 
where $y=r+\gamma ma{x}_{{a}^{\prime}}Q({s}^{\prime},{a}^{\prime};{\theta}_{i}^{})$ is the target network (parameterized as ${\theta}_{i}^{}$) that was generated from previous iterations. The key component that helps to stabilize learning is the experience replay memory Lin (1992) which stores past experiences. An update is performed by drawing a batch of 32 experiences (minibatch) uniformly random from the replay memory—doing so ensures the $i.i.d.$ property of the sampled data thus stabilizes the learning. Reward clipping also helps to make DQN work. All rewards are clipped to $[1,1]$ thus avoids the potential instability brought by various reward scales in different environments.
2.2 Asynchronous Advantage ActorCritic
The asynchronous advantage actorcritic (A3C) algorithm Mnih et al. (2016) is a policybased method that combines the actorcritic framework with a deep neural network. A3C learns both a policy function $\pi ({a}_{t}{s}_{t};\theta )$ (parameterized as $\theta $) and a value function $V({s}_{t};{\theta}_{v})$ (parameterized as ${\theta}_{v}$). The policy function is the actor that decides which action to take while the value function is the critic that evaluates the quality of the action and also bootstraps learning. The policy loss given by Mnih et al. (2016) is
$$\begin{array}{cc}\hfill {L}_{policy}^{a3c}=& {\nabla}_{\theta}log(\pi ({a}_{t}{s}_{t};\theta ))(Q{}^{(n)}({s}_{t},{a}_{t};\theta ,{\theta}_{v})V({s}_{t};{\theta}_{v}))\hfill \\ & {\beta}^{a3c}\mathscr{H}{\nabla}_{\theta}\left(\pi ({s}_{t};\theta )\right)\hfill \end{array}$$ 
where ${Q}^{(n)}({s}_{t},{a}_{t};\theta ,{\theta}_{v})={\sum}_{k=0}^{n1}{\gamma}^{k}{r}_{t+k}+{\gamma}^{n}V({s}_{t+n};{\theta}_{v})$ is the $n$step bootstrapped value that is bounded by a hyperparameter ${t}_{max}$ ($n\le {t}_{max}$). $\mathscr{H}$ is an entropy regularizer for policy $\pi $ (weighted by ${\beta}^{a3c}$) which helps to prevent premature convergence to suboptimal policies. The value loss is
$${L}_{value}^{a3c}={\nabla}_{{\theta}_{v}}\left({\left({Q}^{(n)}({s}_{t},{a}_{t};\theta ,{\theta}_{v})V({s}_{t};{\theta}_{v})\right)}^{2}\right)$$ 
The A3C loss is then
$${L}^{a3c}={L}_{policy}^{a3c}+\alpha {L}_{value}^{a3c}$$  (1) 
where $\alpha $ is a weight for the value loss. A3C runs $k$ actorlearners in parallel and each with their own copies of the environment and parameters. An update is performed using data collected from all actors. In this work, we use the feedforward version of A3C Mnih et al. (2016) for all experiments. The architecture consists of three convolutional layers, one fully connected layer (fc1), followed by two branches of a fully connected layer: a policy function output layer (fc2) and a value function output layer (fc3).
2.3 Transformed Bellman Operator for A3C
While the reward clipping technique helped to reduce the variance and stabilize learning in DQN, Hester et al. (2018) found that clipping all rewards to $[1,1]$ hurts the performance in games where the reward has various scales. For example, in the game of MsPacman, a single dot is worth $10$ points, while a cherry bonus is worth $100$ points; when both are clipped to $1$, the agent becomes incapable of distinguishing between small and large rewards, resulting in reduced performance. Pohlen et al. (2018) proposed the transformed Bellman operator to overcome this problem in DQN. Instead of changing the magnitude of rewards, Pohlen et al. (2018) considers reducing the scale of the actionvalue function, which enables DQN to use raw rewards instead of clipped ones. In particular, a transform function
$$h:z\mapsto sign(z)\left(\sqrt{z+1}1\right)+\epsilon z$$  (2) 
is applied to reduce the scale of ${Q}^{(n)}({s}_{t},{a}_{t};\theta ,{\theta}_{v})$ and $Q$ is transformed as
$${Q}_{TB}^{(n)}({s}_{t},{a}_{t};\theta ,{\theta}_{v})=\sum _{k=0}^{n1}h\left({\gamma}^{k}{r}_{t+k}+{\gamma}^{n}{h}^{1}\left(V({s}_{t+n};{\theta}_{v})\right)\right)$$  (3) 
In this work, we apply the transformed Bellman operator in the A3C algorithm and use the raw reward value (instead of clipped) to perform updates. We denote this method as A3CTB.
2.4 SelfImitation Learning
The selfimitation learning (SIL) algorithm aims to encourage the agent to learn from its own past good experiences Oh et al. (2018). Built on the actorcritic framework Mnih et al. (2016), SIL adds a replay buffer $\mathcal{D}=({s}_{t},{a}_{t},{G}_{t})$ to store the agent’s past experiences. The authors propose the following offpolicy actorcritic loss
${L}_{policy}^{sil}$  $=log(\pi ({a}_{t}{s}_{t};\theta ))(G{}_{t}V({s}_{t};{\theta}_{v}){)}_{+}$  
${L}_{value}^{sil}$  $={\displaystyle \frac{1}{2}}{{\left({G}_{t}V({s}_{t};{\theta}_{v})\right)}_{+}}^{2}$ 
where ${(\cdot )}_{+}=max(\cdot ,0)$ and ${G}_{t}={\sum}_{k=0}^{\mathrm{\infty}}{\gamma}^{k}{R}_{t+k+1}$ is the discounted sum of rewards. The SIL loss is then
$${L}^{sil}={L}_{policy}^{sil}+{\beta}^{sil}{L}_{value}^{sil}$$  (4) 
In this work, we leverage this framework and incorporate SIL in A3CTB (see Section 3.1).
2.5 Supervised Pretraining
In our previous work, supervised pretraining consists of a twostage learning hierarchy de la Cruz et al. (2018): 1) pretraining on human demonstration data, and 2) initializing a DRL agent’s network with the pretrained network $\theta \leftarrow {\theta}_{s}$. It uses nonexpert human demonstrations as its training data where the game states are the neural network’s inputs and assume the nonoptimal human actions as the true labels for each game state. The network is pretrained with the crossentropy loss
$${L}_{s}=\sum _{\forall x}p(x)log\left(q(x;{\theta}_{s})\right)$$ 
where $x$ is the image game state $s$ and $p(x)$ is the distribution over discrete variable $x$ represented here as a onehot vector of the human action $a$; while $q(x;{\theta}_{s})$ is the output distribution of the supervised learning neural network with the weights ${\theta}_{s}$.
2.6 Supervised Autoencoder
An autoencoder learns to reconstruct its inputs and has been used for unsupervised learning of features. A supervised autoencoder (SAE) is an autoencoder with an additional supervised loss that can better extract representations that are tailored to the class labels. For example in Le et al. (2018), the authors consider a supervised learning setting where the goal is to learn a mapping between some inputs $X\in \mathbb{R}$ and some targets $Y\in \mathbb{R}$. Instead of learning with only a supervised loss, an auxiliary reconstruction loss is integrated and the following SAE objective is proposed:
$${L}^{sae}={L}_{s}^{sae}({W}_{s}F{x}_{i},{y}_{i})+{L}_{ae}^{sae}({W}_{ae}F{x}_{i},{x}_{i})$$  (5) 
where $F$ is the weights of a neural network; ${W}_{s}$ and ${W}_{ae}$ are the weights for the supervised output layer and the autoencoder output layer respectively. Here, ${L}_{s}^{sae}$ and ${L}_{ae}^{sae}$ can be any loss functions (e.g., MSE).
Existing work have considered training using an SAE loss from scratch. Our method is different in that i) we consider the SAE loss as a pretraining method instead of training from scratch and ii) we jointly pretrain a supervised loss and a reconstruction loss and then use the learned parameters as initialization (i.e., the twostage hierarchy described in Section 2.5). In this work, we explore if incorporating an unsupervised loss in supervised pretraining can further boost the learning of an agent.
3 Methodologies
This section describes our proposed algorithms. First, we show that by incorporating the SIL framework (see Section 2.4), we can further improve the performance of the original A3C algorithm Mnih et al. (2016) and the A3CTB variant from our previous work de la Cruz et al. (2018); we term this new method as A3CTB+SIL. Then, we introduce our proposed pretraining methods and show that, after pretraining, A3CTB+SIL can achieve superior results; its performance on MsPacman exceeds or is comparable to some stateoftheart algorithms that use human demonstrations (e.g., Hester et al. (2018); Oh et al. (2018)), and is also much lower on computational demands.
3.1 A3C with SelfImitation Learning
We incorporate the selfimitation learning (SIL) framework (see Section 2.4) in A3CTB with the following modifications. To enable using raw rewards (as was done in A3CTB), we apply the transformation function $h$ (Equation (2)) to the returns as ${G}_{t}=h(r{}_{t+1}+\gamma h{}^{1}({G}_{t+1}))$. We also add a SILlearner in parallel with the $k$ actorlearners in A3C (i.e., there are a total of $k+1$ parallel threads). The SILlearner does not have its own copy of the environment; it learns by optimizing ${L}^{sil}$ using minibatch sampling from $\mathcal{D}$. The SILlearner acts similarly to the other actorlearners as it updates the global network asynchronously. Each actorlearner contributes to $\mathcal{D}$ through a shared episode queue ${\mathcal{Q}}_{\mathcal{E}}$, where $\mathcal{E}$ is an episode buffer for each actorlearner that stores observation at time $t$ as $\{{s}_{t},{a}_{t},{r}_{t}\}$, until a terminal state is reached (i.e., the end of an episode). At a terminal state, the actorlearner computes the returns, ${G}_{t}$, with the transformation, $h$, for each step in the episode. Then $\mathcal{E}$ with the computed transformed returns are added to the shared episodes queue ${\mathcal{Q}}_{\mathcal{E}}$. The pseudocode for the SILlearner is shown in Algorithm 3.1. We denote this method as A3CTB+SIL.
[ht!]
\[email protected]@algorithmic[1]
\STATE// Assume global shared parameter vector $\theta $ and ${\theta}_{v}$ and global shared counter $T\mathrm{=}\mathrm{0}$
\STATE// Assume global shared episodes queue ${\mathrm{Q}}_{\mathrm{E}}$
\STATE// Assume threadspecific parameter vectors ${\theta}^{{}^{\mathrm{\prime}}}$ and ${\theta}_{v}^{{}^{\mathrm{\prime}}}$
\STATEInitialize replay memory $\mathcal{D}=\mathrm{\varnothing}$
\REPEAT\STATESynchronize parameters ${\theta}^{{}^{\prime}}\leftarrow \theta $ and ${\theta}_{v}^{{}^{\prime}}\leftarrow {\theta}_{v}$
\FOR$m\leftarrow 1$ \TO$M$
\STATESample a minibatch $\{{s}_{j},{a}_{j},{G}_{j}\}$ from $\mathcal{D}$
\STATECompute gradients w.r.t. ${\theta}^{{}^{\prime}}:d\theta \leftarrow {\nabla}_{{\theta}^{{}^{\prime}}}\text{log}\pi ({a}_{j}{s}_{j};{\theta}^{{}^{\prime}}){({G}_{j}V({s}_{j};{\theta}_{v}^{{}^{\prime}}))}_{+}$
\STATECompute gradients w.r.t. ${\theta}_{v}^{{}^{\prime}}:d{\theta}_{v}\leftarrow \partial {({({G}_{j}V({s}_{j};{\theta}_{v}^{{}^{\prime}}))}_{+})}^{2}/\partial {\theta}_{v}^{{}^{\prime}}$
\STATEPerform asynchronous update of $\theta $ using $d\theta $ and ${\theta}_{v}$ using $d{\theta}_{v}$
\ENDFOR\WHILElen(${\mathcal{Q}}_{\mathcal{E}}$) $>0$
\STATEDequeue first episode $\mathcal{E}$ from ${\mathcal{Q}}_{\mathcal{E}}$
\STATE$\mathcal{D}\leftarrow \mathcal{D}\cup \{{S}_{t},{A}_{t},{G}_{t}\}$ for all $t$ in $\mathcal{E}$
\ENDWHILE\UNTIL$T>{T}_{max}$
3.2 Pretraining Methods
We now introduce our pretraining methods. The same set of nonexpert human demonstration data collected from de la Cruz et al. (2018) is used for all pretraining (see Table 1). This work deviates from our previous work in two aspects: 1) we integrate multiple losses for pretraining while the previous work only considered supervised pretraining, and 2) we train using A3CTB+SIL while the previous work train using A3CTB; we shall see that integrating SIL is beneficial for our pretraining approach and will discuss the reasons later in this section.
From our previous work, supervised pretraining on demonstrations can accelerate learning in A3C and A3CTB de la Cruz et al. (2018); for brevity, we denote this method as [SL]^{1}^{1} 1 We denote all pretraining methods with their names in brackets. pretraining. However, there were two potential problems with [SL] pretraining: 1) the value output layer (fc3) of A3C is not pretrained, and 2) training only to minimize the loss between human and agent actions is suboptimal since the action demonstrated by the human could be noisy.
To tackle the first problem, we pretrain an aggregated loss that consists of the supervised and the value return losses. For the supervised component, we use the SIL policy loss ${L}_{policy}^{sil}$ which can be interpreted as the crossentropy loss $log(\pi ({a}_{t}{s}_{t};\theta ))$ weighted by ${({G}_{t}V({s}_{t};{\theta}_{v}))}_{+}$ Oh et al. (2018). The ${(\cdot )}_{+}$ operator encourages the agent to imitate past decisions only when the returns are larger than the value. The value return loss is nearly identical to the value loss ${L}_{value}^{sil}$ in SIL, but without the ${(\cdot )}_{+}$ operator. Note that this is also similar to A3C’s value loss ${L}_{value}^{a3c}$, but instead of the $n$step bootstrapped value ${Q}^{(n)}$, we use the discounted returns ${G}_{t}$, which can be easily computed from the human demonstration data since it contains the full trajectory of each episode. We denote this method as [SL+V] pretraining.
de la Cruz et al. (2018) also revealed that supervised pretraining learns features that are geared more towards the supervised loss. For example in Pong, the area around the paddle is an important region since the paddle movements are associated with human actions. This implies the second problem mentioned above; if the features learned to focus on human actions only, they might not generalize well to new trajectories. To obtain extra information in addition to the supervised features, we take inspiration from the supervised autoencoder framework which jointly trains a classifier and an autoencoder Le et al. (2018); we believe this approach will retain the important features learned through supervised pretraining and at the same time, learns additional general features from the added autoencoder loss. Finally, we blend in the value loss ${L}_{v}^{saev}$ with the supervised and autoencoder losses as
${L}^{saev}={L}_{s}^{saev}({W}_{s}F{x}_{i},{y}_{i})$  $+{L}_{ae}^{saev}({W}_{ae}F{x}_{i},{x}_{i})$  
$+{L}_{v}^{saev}({W}_{v}F{x}_{i},{x}_{i}).$  (6) 
We denote this method as [SL+V+AE] pretraining. The network architecture of this pretraining method is shown in Figure 1. In this network, Tensorflow’s SAME padding option is used to ensure that the input size is the same as the output size which is inherently necessary for reconstructing the input. This change results in a final output of $88x88$ of the neural network due to the existing network architecture filters used. Thus, instead of downsizing the output from $88x88$ to $84x84$ (which is the original input size of A3C), we changed the input size to $88x88$ in the spirit of the work of Kimura (2018) which uses autoencoder for pretraining.
There are two strong motivations to use selfimitation learning in A3C when using pretraining. First, human demonstration data can be loaded into SIL’s memory to jumpstart the memory and continue learning with the data. Second, the motivation of jointly pretrain with multiple losses, especially with a value loss, is not only to learn better features but also to use the pretrained policy and value layers into the A3C network. Adding the value loss allows pretraining the entire network. In turn, since SIL selfimitates its own experience, data generated during early stages are potentially closely related to the policy and value learned from pretraining; the learning speed could be increased more at the early stage of training. By using SIL, pretraining addresses feature learning while implicitly addressing policy learning.
4 Experiments and Results
We then present our experiments. First, we show that A3CTB+SIL exceeds the performance of A3C and A3CTB. As a comparison, we also evaluate A3C+SIL since Oh et al. (2018) implemented SIL in the synchronous version of A3C (i.e., A2C) Mnih et al. (2016), which is different from our implementation that the SILlearner is asynchronous. Then, we present our experiments and results for the pretraining approaches (all trained in A3CTB+SIL after pretraining) and show that they all outperform the baseline A3C algorithm. We use the same set of parameters across all experiments, shown in Table 2.
4.1 A3CTB+SIL
Figure 2 shows the performance of A3CTB+SIL when compared to the baseline A3C Mnih et al. (2016), A3CTB de la Cruz et al. (2018), and A3C+SIL. A3C+SIL helps in both games and shows better improvement in MsPacman. This is consistent with the findings in Oh et al. (2018) that imitating past good experiences encourages exploration, which is beneficial for hard exploration games. Our proposed method A3CTB+SIL shows the best performance among all. The largely improved score in MsPacman indicates that it is important for the agent to be able to distinguish big and small rewards (the function of TB); SIL helps to imitate past experiences with large returns.
4.2 A3CTB+SIL with Pretraining
Since A3CTB+SIL has shown the largest improvement without pretraining, from now on, we investigate if using our new pretraining approaches can further accelerate A3CTB+SIL. That is, after pretraining, all agents are then trained in A3CTB+SIL. Figure 3 shows the results for pretraining methods. In the game of MsPacman, while [SL] already sees slight improvements over the baseline, both [SL+V] and [SL+V+AE] show superior improvements over [SL], achieving a testing reward (averaged over three trials) at around 8,000. Compared with some stateoftheart results such as Hester et al. (2018) and Oh et al. (2018) where the final rewards for MsPacman are roughly around 5,000 and 4,000 respectively,^{2}^{2} 2 Numbers approximately read from the figures of the mentioned papers., our method largely exceeded theirs.
In the game of Pong, all pretraining methods exceed the baseline performance but the amount of improvements are not as large as in MsPacman. One reason could be that Pong is a relatively easy game to play in Atari and the agent is able to find a good policy even when learning from scratch. In addition, note that [SL] actually has the fastest learning speed among other pretraining methods, which is intuitively reasonable. Catching the ball is probably the most important behavior to learn in Pong and this movement is highly associated with the classification of actions; learning the value function and the feature representations did not seem to add additional benefits than learning just an action classifier.
4.2.1 Ablation Study
We want to see how useful the feature representations learned during the pretraining stage are to a DRL agent. It is known that general features are retained in the hidden layers of a neural network while the output layer is more taskspecific Yosinski et al. (2014). Therefore, in this set of experiments we exclude all output fully connected layers (fc2 and fc3) and only initialize a new A3C network with pretrained convolutional layers and the fc1 layer, then train it in A3CTB+SIL. This experiment will allow us to investigate how important is the general feature learning as to the taskspecific policy learning for a DRL agent.
Figure 4 shows the results on transferring pretrained parameters without the fully connected out layers (“no fc” refers to “no fc2 and no fc3”). Note that the [AE] pretraining refers to pretraining with only the reconstruction loss ${L}_{ae}^{saev}$ (see Equation (3.2)); since the autoencoder model trains on fc1 and does not affect fc2 and fc3, we consider it as “transfer without fc2 and fc3” and present its results here instead of in previous experiments where “all layers are transferred.” It is interesting to observe that, in MsPacman, the performance of [SL+V] no fc and [SL+V+AE] no fc dropped relatively compared to when transferring all layers. While in Pong, not transferring the output layers did not affect the performance as much. This indicates again that, due to the nature of the game MsPacman, more exploration is needed and only having a good initial feature representation is not as good as when knowing some priors about both the features and the behaviors. However, in games that require less strategy learning and exploration, having a good initial feature of the environment can already provide a performance boost. For example in Pong, even when not transferring the output layers, the information retained in the hidden layers are still highly related to the paddle movement (since it classifies actions), which could be the most important thing to learn in Pong.
The lower performance of not transferring the output layers for MsPacman also shows the benefits of pretraining both the policy layer and the value layer. As shown in Figure 3 for MsPacman, when using all layers of the pretrained network, it has a higher initial testing reward compared to the baseline. This indicates that the initial policy was better after pretraining both policy and value layers. We believe that this could be a way of identifying when to use the full pretrained network and when to exclude the output layers. Future work should study if the following hypotheses hold true:

1.
If the initial performance of the pretrained network is better than the baseline, then one should use the full pretrained network.

2.
Otherwise, it might be better to use the pretrained network without the output layers.
5 Related Work
Our work is largely inspired by the literature of transfer for supervised learning Pan et al. (2010), particularly in deep supervised learning where a model is rarely trained from scratch. Instead, parameters are usually initialized from a larger pretrained model (e.g., ImageNet Russakovsky et al. (2015)) and then trained in the target dataset. Yosinski et al. (2014) performed a thorough study on how transferable the learned features at each layer are in a deep convolutional neural network, showing that a pretrained ImageNet classification model is able to capture general features that are transferable to a new classification task. Similarly, unsupervised pretraining a neural network can extract key variations of the input data, which in term helps the supervised finetuning stage that follows to converge faster Erhan et al. (2010); Bengio et al. (2013). In this work, we combine the supervised and unsupervised methods into the supervised autoencoder framework Le et al. (2018) as our pretraining stage. Intuitively, the supervised component could guide the autoencoder to learn features that are more specific to the task.
It has been shown that incorporating useful prior information can benefit the policy learning of an RL agent Schaal (1997). Learning from Demonstrations (LfD) integrates such a prior by leveraging available expert demonstrations; the demonstration can either be generated by another agent or be collected from human demonstrators Argall et al. (2009). Some previous work seeks to learn a good initial policy from the demonstration then later on combine with the agentgenerated data to train in RL Kim et al. (2013); Piot et al. (2014); Chemali and Lazaric (2015). More recent approaches have considered LfD in the context of DRL. Christiano et al. (2017) proposes to learn the reward function from human feedback. Gao et al. (2018) considers the scenario when demonstrations are imperfect and proposes a normalized actorcritic algorithm. Perhaps the closest work to ours is the deep Qlearning from demonstration (DQfD) algorithm Hester et al. (2018). In DQfD, the agent is first pretrained over the human demonstration data using a combined supervised and temporal difference losses. However, during the DQN training stage, the agent continues to jointly minimize the temporal difference loss with a large margin supervised loss when sampling from expert demonstration data. Our work instead uses a supervised autoencoder as pretraining, which explicitly emphasizes the representation learning and our pretraining losses are not carried through in the RL training. DQfD was further improved as ApeX DQfD by Pohlen et al. (2018) where the transformed Bellman operator was applied to reduce the variance of the actionvalue function and is applied to a largescale distributed DQN. We empirically observe that our pretraining approaches obtained a higher score in MsPacman than that of DQfD and are comparable to ApeX DQfD (see Section 4.2). However, we are unable to compare directly as we do not have the computational resources to run such a large scale experiment. In addition, note that ApeX DQfD does not have a pretraining stage; DQfD addresses both feature and policy learning while our work only addresses feature learning. Therefore, our method is not directly comparable to the above.
The use of unsupervised auxiliary losses has been explored in both deep supervised learning and DRL. For example, Zhang et al. (2016) uses unsupervised reconstruction losses to aid in learning largescale classification tasks; Jaderberg et al. (2017) combines additional control tasks that predict feature changes as auxiliaries. Our methods of pretraining via supervised autoencoder can be viewed as leveraging the reconstruction loss as an auxiliary task, which guides the agent to learn desirable features for the given task.
6 Discussion and Conclusion
In this work, we studied several pretraining methods and showed that the A3C algorithm could be sped up by pretraining on a small set of nonexpert human demonstration data. In particular, we proposed to integrate rewards in supervised policy pretraining, which helps to quantify how good a demonstrated action was. The component of the value function and autoencoder pretraining yielded the most significant performance improvements and exceeded the stateoftheart results in the game of MsPacman. Our approach is lightweight and easy to implement in a single machine.
While pretraining works well in the two games presented in this paper, there is a need to perform this experiment in more games to show the generality of our method. Looking into what features are learned during pretraining is also interesting to study. de la Cruz et al. (2018) visualized the feature learned in supervised pretraining and a final DRL model and found that they share some common patterns, indicating why pretraining is useful. In future work, we are interested in studying how the learned feature pattern differs from each pretraining method. Lastly, pretraining methods only address the problem of feature learning in DRL but do not aid policy learning. To further accelerate learning, we plan on looking into how could policy learning be improved using human demonstration data. Some existing work like DQfD integrate the supervised loss not only during pretraining but also during training the DRL agent Hester et al. (2018); others leverage human demonstrations as advice and constantly providing suggestions during policy learning Wang and Taylor (2017). We attribute these as our future directions.
Acknowledgements
The A3C implementation was a modification of https://github.com/miyosuda/async_deep_reinforce. The authors thank NVidia for donating a graphics card used in these experiments. This research used resources of Kamiak, Washington State University’s high performance computing cluster, where we ran all our experiments.
References
 (1)
 Argall et al. (2009) Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. 2009. A survey of robot learning from demonstration. Robotics and autonomous systems 57, 5 (2009), 469–483.
 Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 8 (2013), 1798–1828.
 Bojarski et al. (2017) Mariusz Bojarski, Philip Yeres, Anna Choromanska, Krzysztof Choromanski, Bernhard Firner, Lawrence Jackel, and Urs Muller. 2017. Explaining how a deep neural network trained with endtoend learning steers a car. arXiv preprint arXiv:1704.07911 (2017).
 Chemali and Lazaric (2015) Jessica Chemali and Alessandro Lazaric. 2015. Direct policy iteration with demonstrations. In IJCAI.
 Christiano et al. (2017) Paul Christiano, Jan Leike, Tom B Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. In NIPS.
 de la Cruz et al. (2018) Gabriel V. de la Cruz, Jr., Yunshu Du, and Matthew E Taylor. 2018. Pretraining with Nonexpert Human Demonstration for Deep Reinforcement Learning. arXiv preprint arXiv:1812.08904 (2018).
 Du et al. (2018) Yunshu Du, Wojciech M Czarnecki, Siddhant M Jayakumar, Razvan Pascanu, and Balaji Lakshminarayanan. 2018. Adapting auxiliary losses using gradient similarity. arXiv preprint arXiv:1812.02224 (2018).
 Erhan et al. (2010) Dumitru Erhan, Yoshua Bengio, Aaron Courville, PierreAntoine Manzagol, Pascal Vincent, and Samy Bengio. 2010. Why does unsupervised pretraining help deep learning? Journal of Machine Learning Research 11, Feb (2010), 625–660.
 Gao et al. (2018) Yang Gao, Ji Lin, Fisher Yu, Sergey Levine, Trevor Darrell, et al. 2018. Reinforcement learning from imperfect demonstrations. arXiv preprint arXiv:1802.05313 (2018).
 Hester et al. (2018) Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, et al. 2018. Deep Qlearning from demonstrations. In AAAI.
 Jaderberg et al. (2017) Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. 2017. Reinforcement learning with unsupervised auxiliary tasks. In ICLR.
 Kim et al. (2013) Beomjoon Kim, Amirmassoud Farahmand, Joelle Pineau, and Doina Precup. 2013. Learning from limited demonstrations. In Advances in Neural Information Processing Systems. 2859–2867.
 Kimura (2018) Daiki Kimura. 2018. DAQN: Deep autoencoder and Qnetwork. arXiv preprint arXiv:1806.00630 (2018).
 Le et al. (2018) Lei Le, Andrew Patterson, and Martha White. 2018. Supervised autoencoders: Improving generalization performance with unsupervised regularizers. In Advances in Neural Information Processing Systems. 107–117.
 Li (2017) Yuxi Li. 2017. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274 (2017).
 Lin (1992) LongJi Lin. 1992. Reinforcement learning for robots using neural networks. Ph.D. Dissertation.
 Miotto et al. (2017) Riccardo Miotto, Fei Wang, Shuang Wang, Xiaoqian Jiang, and Joel T Dudley. 2017. Deep learning for healthcare: review, opportunities and challenges. Briefings in bioinformatics (2017).
 Mirowski et al. (2016) Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. 2016. Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673 (2016).
 Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In ICML. 1928–1937.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Humanlevel control through deep reinforcement learning. Nature 518, 7540 (2015), 529.
 Oh et al. (2018) Junhyuk Oh, Yijie Guo, Satinder Singh, and Honglak Lee. 2018. Selfimitation learning. In ICML.
 Pan et al. (2010) Sinno Jialin Pan, Qiang Yang, et al. 2010. A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22, 10 (2010), 1345–1359.
 Piot et al. (2014) Bilal Piot, Matthieu Geist, and Olivier Pietquin. 2014. Boosted Bellman residual minimization handling expert demonstrations. In ECML/PKDD.
 Pohlen et al. (2018) Tobias Pohlen, Bilal Piot, Todd Hester, Mohammad Gheshlaghi Azar, Dan Horgan, David Budden, Gabriel BarthMaron, Hado van Hasselt, John Quan, Mel Večerík, et al. 2018. Observe and look further: Achieving consistent performance on Atari. arXiv preprint arXiv:1805.11593 (2018).
 Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.
 Schaal (1997) Stefan Schaal. 1997. Learning from demonstration. In Advances in neural information processing systems. 1040–1046.
 Schmitt et al. (2018) Simon Schmitt, Jonathan J Hudson, Augustin Zidek, Simon Osindero, Carl Doersch, Wojciech M Czarnecki, Joel Z Leibo, Heinrich Kuttler, Andrew Zisserman, Karen Simonyan, et al. 2018. Kickstarting deep reinforcement learning. arXiv preprint arXiv:1803.03835 (2018).
 Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484–489.
 Silver et al. (2018) David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. 2018. A general reinforcement learning algorithm that masters chess, shogi, and Go through selfplay. Science 362, 6419 (2018), 1140–1144.
 Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
 Taylor and Stone (2009) Matthew E Taylor and Peter Stone. 2009. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research 10, Jul (2009), 1633–1685.
 Wang and Taylor (2017) Zhaodong Wang and Matthew E Taylor. 2017. Improving reinforcement learning with confidencebased demonstrations. In Proceedings of the 26th International Conference on Artificial Intelligence (IJCAI).
 Yosinski et al. (2014) Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks?. In Advances in neural information processing systems. 3320–3328.
 Zhang et al. (2016) Yuting Zhang, Kibok Lee, and Honglak Lee. 2016. Augmenting supervised neural networks with unsupervised objectives for largescale image classification. In ICML. 612–621.