Abstract
Robust realworld learning should benefit from both demonstrations andinteractions with the environment. Current approaches to learning fromdemonstration and reward perform supervised learning on expert demonstrationdata and use reinforcement learning to further improve performance based on thereward received from the environment. These tasks have divergent losses whichare difficult to jointly optimize and such methods can be very sensitive tonoisy demonstrations. We propose a unified reinforcement learning algorithm,Normalized ActorCritic (NAC), that effectively normalizes the Qfunction,reducing the Qvalues of actions unseen in the demonstration data. NAC learnsan initial policy network from demonstrations and refines the policy in theenvironment, surpassing the demonstrator's performance. Crucially, bothlearning from demonstration and interactive refinement use the same objective,unlike prior approaches that combine distinct supervised and reinforcementlosses. This makes NAC robust to suboptimal demonstration data since the methodis not forced to mimic all of the examples in the dataset. We show that ourunified reinforcement learning algorithm can learn robustly and outperformexisting baselines when evaluated on several realistic driving games.
Quick Read (beta)
Reinforcement Learning from Imperfect Demonstrations
Abstract
Robust realworld learning should benefit from both demonstrations and interactions with the environment. Current approaches to learning from demonstration and reward perform supervised learning on expert demonstration data and use reinforcement learning to further improve performance based on the reward received from the environment. These tasks have divergent losses which are difficult to jointly optimize and such methods can be very sensitive to noisy demonstrations. We propose a unified reinforcement learning algorithm, Normalized ActorCritic (NAC), that effectively normalizes the Qfunction, reducing the Qvalues of actions unseen in the demonstration data. NAC learns an initial policy network from demonstrations and refines the policy in the environment, surpassing the demonstrator’s performance. Crucially, both learning from demonstration and interactive refinement use the same objective, unlike prior approaches that combine distinct supervised and reinforcement losses. This makes NAC robust to suboptimal demonstration data, since the method is not forced to mimic all of the examples in the dataset. We show that our unified reinforcement learning algorithm can learn robustly and outperform existing baselines when evaluated on several realistic driving games.
1 Introduction
Deep reinforcement learning (RL) has achieved significant success on many complex sequential decisionmaking problems. However, RL algorithms usually require a large amount of interactions with an environment to reach good performance (Kakade et al., 2003); initial performance may be nearly random, clearly suboptimal, and often rather dangerous in realworld settings such as autonomous driving. Learning from demonstration is a wellknown alternative, but typically does not leverage reward, and presumes relatively smallscale noisefree demonstrations. We develop a new robust algorithm that can learn value and policy functions from state, action and reward ($s,a,r$) signals that either come from imperfect demonstration data or the environment.
Recent efforts toward policy learning which does not suffer from a suboptimal initial performance generally leverage an initial phase of supervised learning and/or auxiliary task learning. Several previous efforts have shown that demonstrations can speed up RL by mimicking expert data with a temporal difference regularizer (Hester et al., 2017) or via gradientfree optimization (Ebrahimi et al., 2017), yet these methods presume nearoptimal demonstrations. (Jaderberg et al., 2016) and (Shelhamer et al., 2016) obtained better initialization via auxiliary task losses (e.g., predicting environment dynamics) in a selfsupervised manner; policy performance is still initially random with these approaches.
A simple combination of several distinct losses can learn from demonstrations; however, it is more appealing to have a single principled loss that is applicable to learning both from the demonstration and from the environment. Our approach, Normalized ActorCritic (NAC), uses a unified loss function to process both offline demonstration data and online experience based on the underlying maximum entropy reinforcement learning framework (Toussaint, 2009; Haarnoja et al., 2017b; Schulman et al., 2017). Our approach enables robust learning from corrupted (or even partially adversarial) demonstrations that contains ($s,a,r$), because no assumption on the optimality of the data is required. A normalized formulation of the soft Qlearning gradient enables the NAC method, which can also be regarded as a variant of the policy gradient.
We evaluate our approach in a toy Minecraft Game, as well as two realistic 3D simulated environments, Torcs and Grand Theft Auto V (GTA V), with either discrete states and tabular Q functions, or raw image input and Q functions approximated by neural networks. Our experimental results outperform previous approaches on driving tasks with only a modest amount of demonstrations while tolerating significant noise in the demonstrations, as our method utilizes rewards rather than simply imitates demonstrated behaviors.
We summarize the contributions in this paper as follows.

•
We propose the NAC method for learning from demonstrations and discover its practical advantages on a variety of environments.

•
To the best of our knowledge, we are the first to propose a unified objective, capable of learning from both demonstrations and environments, that outperforms methods including the ones with an explicit supervised imitation loss.

•
Unlike other methods that utilize supervised learning to learn from demonstrations, our pure reinforcement learning method is robust to noisy demonstrations.
2 Preliminaries
In this section, we will briefly review the reinforcement learning techniques that our method is built on, including maximum entropy reinforcement learning and the soft Qlearning.
2.1 Maximum Entropy Reinforcement Learning
The reinforcement learning problem we consider is defined by a Markov decision process(MDP) (Thie, 1983). Specifically, the MDP is characterized by a tuple $$,$\mathbb{A}$,R,T,$\gamma >$, where $\mathbb{S}$ is the set of states, $\mathbb{A}$ is the set of actions, $R(s,a)$ is the reward function, $T(s,a,{s}^{\prime})=P({s}^{\prime}s,a)$ is the transition function and $\gamma $ is the reward discount factor. An agent interacts with the environment by taking an action at a given state, receiving the reward, and transiting to the next state.
In the standard reinforcement learning setting (Sutton & Barto, 1998), the goal of an agent is to learn a policy ${\pi}_{std}$, such that an agent maximizes the future discounted reward:
$\begin{array}{c}\hfill {\pi}_{std}=\underset{\pi}{argmax}{\displaystyle \sum _{t}}{\gamma}^{t}\underset{{s}_{t},{a}_{t}\sim \pi}{\mathbb{E}}[{R}_{t}]\hfill \end{array}$  (1) 
Maximum entropy policy learning (Ziebart, 2010; Haarnoja et al., 2017b) uses an entropy augmented reward. The optimal policy will not only optimize for discounted future rewards, but also maximize the discounted future entropy of the action distribution:
$\begin{array}{c}\hfill {\pi}_{ent}=\underset{\pi}{argmax}{\displaystyle \sum _{t}}{\gamma}^{t}\underset{{s}_{t},{a}_{t}\sim \pi}{\mathbb{E}}[{R}_{t}+\alpha H(\pi (\cdot {s}_{t}))]\hfill \end{array}$  (2) 
where $\alpha $ is a weighting term to balance the importance of the entropy. Unlike previous attempts that only adds the entropy term at a single time step, maximum entropy policy learning maximizes the discounted future entropy over the whole trajectory. Maximum entropy reinforcement learning has many benefits, such as better exploration in multimodal problems and connections between Qlearning and the actorcritic method (Haarnoja et al., 2017b; Schulman et al., 2017).
2.2 Soft Value Functions
Since the maximum entropy RL paradigm augments the reward with an entropy term, the definition of the value functions naturally changes to
${Q}_{\pi}(s,a)$  $={R}_{0}+\underset{({s}_{t},{a}_{t})\sim \pi}{\mathbb{E}}{\displaystyle \sum _{t=1}^{\mathrm{\infty}}}{\gamma}^{t}({R}_{t}+\alpha H(\pi (\cdot {s}_{t})))$  (3)  
${V}_{\pi}(s)$  $=\underset{({s}_{t},{a}_{t})\sim \pi}{\mathbb{E}}{\displaystyle \sum _{t=0}^{\mathrm{\infty}}}{\gamma}^{t}({R}_{t}+\alpha H(\pi (\cdot {s}_{t})))$  (4) 
where $\pi $ is some policy that value functions evaluate on. Given the stateaction value function ${Q}^{*}(s,a)$ of the optimal policy, (Ziebart, 2010) shows that the optimal state value function and the optimal policy could be expressed as:
${V}^{*}(s)$  $=\alpha \mathrm{log}{\displaystyle \sum _{a}}\mathrm{exp}({Q}^{*}(s,a)/\alpha )$  (5)  
${\pi}^{*}(as)$  $=\mathrm{exp}(({Q}^{*}(s,a){V}^{*}(s))/\alpha )$  (6) 
2.3 Soft QLearning and Policy Gradient
With the entropy augmented reward, one can derive the soft versions of Qlearning (Haarnoja et al., 2017a) and policy gradient. The soft Qlearning gradient is given by
${\nabla}_{\theta}{Q}_{\theta}(s,a)({Q}_{\theta}(s,a)\widehat{Q}(s,a))$  (7) 
where $\widehat{Q}(s,a)$ is a bootstrapped Qvalue estimate obtained by $R(s,a)+\gamma {V}_{Q}({s}^{\prime})$. Here, $R(s,a)$ is the reward received from the environment, ${V}_{Q}$ is computed from ${Q}_{\theta}(s,a)$ with Equation (5). We can also derive a policy gradient, which includes the gradient of form:
$\mathbb{E}[{\displaystyle \sum _{t=0}^{\mathrm{\infty}}}{\nabla}_{\theta}\mathrm{log}{\pi}_{\theta}({\widehat{Q}}_{\pi}b({s}_{t}))+\alpha {\nabla}_{\theta}H({\pi}_{\theta}(\cdot {s}_{t}))]$  (8) 
where $b({s}_{t})$ is some arbitrary baseline (Schulman et al., 2017).
3 Robust Learning from Demonstration and Reward
Given a set of demonstrations that contains ($s,a,r,{s}^{\prime}$) and the corresponding environment, an agent should perform appropriate actions when it starts the interaction, and continue to improve. Although a number of offpolicy RL algorithms could in principle be used to learn directly from offpolicy demonstration data, standard methods can suffer from extremely poor performance when trained entirely on demonstration data. This can happen when the demonstration set is a strongly biased sample of the environment transitions, which violates the assumptions of many offpolicy RL methods. Although they are closely related, offpolicy learning and learning from demonstrations are different problems. In section 5, we show that Qlearning completely fails on the demonstration data. The intuition behind this problem is that if the Qfunction is trained only on good data, it has no way to understand why the action taken is appropriate: it will assign a high Qvalue, but will not necessarily assign a low Qvalue to other alternative actions.
The framework of soft optimality provides us with a natural mechanism to mitigate this problem by normalizing the Qfunction over the actions. Our approach, Normalized ActorCritic (NAC), utilizes soft policy gradient formulations described in Section 2.3 to obtain a Qfunction gradient that reduces the Qvalues of actions that were not observed along the demonstrations. In other words, without data to indicate otherwise, NAC will opt to follow the demonstrations. This method is a welldefined RL algorithm without any auxiliary supervised loss and hence, it is able to learn without bias in the face of lowquality demonstration data. We will first describe our algorithm, and then discuss why it performs well when trained on the demonstrations.
3.1 Normalized ActorCritic for Learning from Demonstration
We propose a unified learning from demonstration approach, which applies the normalized actorcritic updates to both off policy demonstrations and inenvironment transitions. The NAC method is derived from the soft policy gradient objective with a $Q$ function parametrization. Specifically, we take gradient steps to maximize the future reward objective (Equation (2)), and parametrize $\pi $ and $V$ in terms of $Q$ (Equation (6) and (5)). As derived in the appendix, the updates for the actor and critic are:
${\nabla}_{\theta}{J}_{PG}=\underset{s,a\sim {\pi}_{Q}}{\mathbb{E}}\left[({\nabla}_{\theta}Q(s,a){\nabla}_{\theta}{V}_{Q}(s))(Q(s,a)\widehat{Q})\right]$  (9)  
${\nabla}_{\theta}{J}_{V}=\underset{s}{\mathbb{E}}\left[{\nabla}_{\theta}{\displaystyle \frac{1}{2}}{({V}_{Q}(s)\widehat{V}(s))}^{2}\right]$  (10) 
where ${V}_{Q}$ and ${\pi}_{Q}$ are deterministic functions of $Q$:
${V}_{Q}(s)$  $=\alpha \mathrm{log}{\displaystyle \sum _{a}}\mathrm{exp}(Q(s,a)/\alpha )$  (11)  
${\pi}_{Q}(as)$  $=\mathrm{exp}((Q(s,a){V}_{Q}(s))/\alpha )$  (12) 
$\widehat{Q}(s,a),\widehat{V}(s)$ are obtained by:
$\widehat{Q}(s,a)=R(s,a)+\gamma {V}_{Q}({s}^{\prime})$  (13)  
$\widehat{V}(s)=\underset{a\sim {\pi}_{Q}}{\mathbb{E}}[R(s,a)+\gamma {V}_{Q}({s}^{\prime})]+\alpha H({\pi}_{Q}(\cdot s))$  (14) 
The difference is the ${\nabla}_{\theta}V(s)$ term comparing NAC’s actor update term (Equation (9)) with the soft Q update (Equation (7)). We emphasize the normalization effect of this term: it avoids pushing up the $Q$values of actions that are not demonstrated. The mechanism is explained in Section 3.2.
The expectations in Eq. (9) and Eq. (10) are taken with respect to ${\pi}_{Q}$. In the demonstration set, we only have transition samples from the behavioral policy $\mu (as)$. To have a proper policy gradient algorithm, we can employ importance sampling to correct the mismatch. To be specific, when estimating ${\mathbb{E}}_{(s,a)\sim {\pi}_{Q}}\left[f(s,a)\right]$, we estimate ${\mathbb{E}}_{(s,a)\sim \mu}\left[f(s,a)\beta \right]$ , where $\beta =\mathrm{min}\{\frac{{\pi}_{Q}(as)}{\mu (as)},c\}$ and $c$ is some constant that prevents the importance ratio from being too large. Although the importance weights are needed to formalize our method as a proper policy gradient algorithm, we find in our empirical evaluation that the inclusion of these weights consistently reduces the performance of our method. We found that omitting the weights results in better final performance even when training entirely on demonstration data. For this reason, our final algorithm does not use importance sampling.
We summarize the proposed method in Algorithm 1. Our method uses samples from the demonstrations and the replay buffer, rather than restricting the samples to be on policy as in standard actorcritic methods. Similar to DQN, we utilize a target network to compute $\widehat{Q}(s,a)$ and $\widehat{V}(s)$, which stabilizes the training process.
3.2 Analysis of the Method
We provide an intuitive analysis in this section to explain why our method can learn from demonstrations while other reinforcement learning methods, such as Qlearning cannot. The states and actions in the demonstrations generally have higher Qvalues than other states. Qlearning will push up $Q(s,a)$ values in a sampled state $s$. However, if the values for the bad actions are not observed, the Qfunction has no way of knowing whether the action itself is good, or whether all actions in that state are good, so the demonstrated action will not necessarily have a higher Qvalue than other actions in the demonstrated state.
Comparing the actor update (Eq. (9)) of our method with the soft Qlearning (Eq. (7)) update, our method includes an extra term in the gradient: ${\nabla}_{\theta}{V}_{Q}(s)$. This term falls out naturally when we derive the update from a policy gradient algorithm, rather than a Qlearning algorithm. Intuitively, this term will decrease ${V}_{Q}(s)$ when increasing $Q(s,a)$ and vice versa, since ${\nabla}_{\theta}Q(s,a)$ and ${\nabla}_{\theta}{V}_{Q}(s)$ have different signs. Because ${V}_{Q}(s)=\alpha \mathrm{log}{\sum}_{a}\mathrm{exp}(Q(s,a)/\alpha )$, decreasing ${V}_{Q}(s)$ will prevent $Q(s,a)$ from increasing for the actions that are not in the demonstrations. That is why the aforementioned normalization effect emerges with the extra ${\nabla}_{\theta}{V}_{Q}(s)$ term.
Besides having the normalizing behaviors, NAC is also less sensitive to noisy demonstrations. Since NAC is an RL algorithm, it is naturally resistant to poor behaviors. One could also see this with similar analysis as above. When there is a negative reward in the demonstrations, $Q(s,a)$ tends to decrease and ${V}_{Q}(s)$ tends to increase, hence having the normalizing behavior in a reverse direction.
Moreover, NAC provides a single and principled approach to learn from both demonstrations and environments. It avoids the use of imitation learning. Therefore, besides its natural robustness to imperfect demonstrations, it is a more unified approach comparing with other methods.
4 Related Work
4.1 Maximum entropy Reinforcement Learning
Maximum entropy reinforcement learning has been explored in a number of prior works (Todorov, 2008; Toussaint, 2009; Ziebart et al., 2008), including several recent works that extend it into a deep reinforcement learning setting (Nachum et al., 2017; Haarnoja et al., 2017b; Schulman et al., 2017). However, most of those works do not deal with the learning from demonstration settings. (Haarnoja et al., 2017b) and (Schulman et al., 2017) propose maximum entropy RL methods to learn from environments. PCL (Nachum et al., 2017) is the only prior work that studies the learning from demonstration task with the maximum entropy RL framework. Unlike their method where the loss is derived from an objective similar to Bellman error, our method is derived from policy gradient. Instead of minimizing Bellman errors, policy gradient directly optimizes future accumulated reward. As shown in Section 5, our method has large performance advantage compared with PCL, due to the different objective function.
Our method not only admits a unified objective on both demonstrations and environments but also performs better than alternative methods, such as PCL (Nachum et al., 2017) and DQfD (Hester et al., 2017). To the best of our knowledge, our proposed method is the first unified method across demonstrations and environments that outperforms methods including the ones with explicit supervised imitation loss such as DQfD.
4.2 Learning from Demonstration
Most of the prior learning from demonstration efforts assume the demonstrations are perfect, i.e. the ultimate goal is to copy the behaviors from the demonstrations. Imitation learning is one of such approaches, examples including (Xu et al., 2016; Bojarski et al., 2016). Extensions such as DAGGER (Ross et al., 2011) are proposed to have the expert in the loop, which further improves the agent’s performance. Recently, (Ho & Ermon, 2016; Wang et al., 2017; Ziebart et al., 2008) explore an adversarial paradigm for the behavior cloning method. Another popular paradigm is Inverse Reinforcement Learning (IRL) (Ng et al., 2000; Abbeel & Ng, 2004; Ziebart et al., 2008). IRL learns a reward model which explains the demonstrations as optimal behavior.
Instead of assuming that the demonstrations are perfect, our pure RL method allows imperfect demonstrations. Our method learns which part of the demonstrations is good and which part is bad, unlike the methods that simply imitate the demonstrated behaviors. We follow the Reinforcement Learning with Expert Demonstrations (RLED) framework (Chemali & Lazaric, 2015; Kim et al., 2013; Piot et al., 2014), where both rewards and actions are available in the demonstrations. The extra reward in the demonstrations allows our method to be aware of poor behaviors in the demonstrations. DQfD (Hester et al., 2017) is a recent method that also uses rewards in the demonstrations. It combines an imitation hinge loss with the Qlearning loss in order to learn from demonstrations and transfer to environments smoothly. Due to the use of the imitation loss, DQfD is more sensitive to noisy demonstrations, as we show in the experiment section.
4.3 Offpolicy Learning
It is tempting to apply various offpolicy methods to the problem of learning from demonstration, such as policy gradient variants (Gu et al., 2017, 2016; Degris et al., 2012; Wang et al., 2016), Qlearning (Watkins & Dayan, 1992) and Retrace (Munos et al., 2016). However, we emphasize that offpolicy learning and learning from demonstration are different problems. For most of the offpolicy methods, their convergence relies on the assumption of visiting each $(s,a)$ pair infinitely many times. In the learning from demonstration setting, the samples are highly biased and offpolicy method can fail to learn anything from the demonstrations, as we explained the Qlearning case in Section 3.
5 Results
Our experiments address several questions: (1) Can NAC benefit from both demonstrations and rewards? (2) Is NAC robust to illbehaved demonstrations? (3) Can NAC learn meaningful behaviors with a limited amount of demonstrations? We compare our algorithm with DQfD (Hester et al., 2017), which has been shown to learn efficiently from demonstrations and to preserve performance while acting in an environment. Other methods include supervised behavioral cloning method, Qlearning, soft Qlearning, the version of our method with importance sampling weighting, PCL and Qlearning, soft Qlearning without demonstrations.
5.1 Environments
We evaluate our result in a grid world, the toy Minecraft (Fig. 2), as well as two realistic 3D simulated environments, Torcs and Grand Theft Auto V (GTA V) shown in Figure 1.
Toy Minecraft: The toy Minecraft is a customized grid world environment. As shown in Figure 2, the agent starts from the left and would like to reach the final goal (marked as a heart). The agent can walk on the green grass and go into the blue water ends the episode. The input to the agent is its current $(x,y)$ location. At each step, the agent can move Up, Down, Left or Right. It gets a reward of $1$ when reaching the goal, $0$ otherwise. For more details, please refer to the OpenAI gym FrozenLake environment (Brockman et al., 2016).
Torcs: Torcs is an opensource racing game that has been used widely as an experimental environment for driving. The goal of the agent is to drive as fast as possible on the track while avoiding crashes. We use an oval twolane racing venue in our experiments. The input to the agent is an 84$\times $84 gray scale image. The agent controls the vehicle at 5Hz, and at each step, it chooses from a set of 9 actions which is a Cartesian product between {left, noop, right} and {up, noop, down}. We design a dense driving reward function that encourages the car to follow the lane and to avoid collision with obstacles. ^{1}^{1} 1 $reward=(1{\mathrm{\U0001d7d9}}_{damage})[(\mathrm{cos}\theta \mathrm{sin}\theta lane\mathrm{\_}ratio)$ $\times speed]+\mathrm{\U0001d7d9}{}_{damage}[10]$, where ${\mathrm{\U0001d7d9}}_{damage}$ is an indicator function of whether the vehicle is damaged at the current state. $lane\mathrm{\_}ratio$ is the ratio between distance to lane center and lane width. $\theta $ is the angle between the vehicle heading direction and the road direction.
GTA: Grand Theft Auto is an actionadventure video game with goals similar in part to the Torcs game, but with a more diverse and realistic surrounding environment, including the presence of other vehicles, buildings, and bridges. The agent observes 84$\times $84 RGB images from the environment. It chooses from 7 possible actions from {leftup, up, rightup, left, noop, right, down} at 6Hz. We use the same reward function as in Torcs.
5.2 Comparisons
We compare our approach with the following methods:

•
DQfD: the method proposed by (Hester et al., 2017). For the learning from demonstration phase, DQfD combines a hinge loss with a temporal difference (TD) loss. For the finetuninginenvironment phase, DQfD combines a hinge loss on demonstrations and a TD loss on both the demonstrations and the policygenerated data. To alleviate overfitting issues, we also include weight decay following the original paper.

•
Qlearning: the classic DQN method (Mnih et al., 2015). We first train DQN with the demonstrations in a replay buffer and then finetune in the environment with regular Qlearning. Similar to DQfD, we use a constant exploration ratio of 0.01 in the finetuning phase to preserve the performance obtained from the demonstrations. We also train from scratch a baseline DQN in the environment, without any demonstration.
 •

•
Behavior cloning with Qlearning: the naive way of combining crossentropy loss with Qlearning. First we perform behavior cloning with crossentropy loss on the demonstrations. Then we treat the logit activations prior the softmax layer as an initialization of the Q function and finetune with regular Qlearning in the environment.

•
Normalized actorcritic with importance sampling: the NAC method with the importance sampling weighting term mentioned in Sec 3.1. The importance weighting term is used to correct the action distribution mismatch between the demonstration and the current policy.

•
Path Consistency Learning (PCL): the PCL (Nachum et al., 2017) method that minimizes the soft path consistency loss. The method proposed in the original paper (denoted as PCLR) does not utilize a target network. We find that PCLR does not work when it is trained from scratch in the visually complex environment. We stabilize it by adding a target network (denoted as PCL), similar to (Haarnoja et al., 2017a).
5.3 Experiments on Toy Minecraft
To understand the basic properties of our proposed method, we design the toy Minecraft environment. In this experiment, the state is simply the location of the agent. We use a tabular $Q$ function. With those settings, we hope to reveal some differences between our NAC algorithm and algorithms that incorporate supervised loss.
As shown in Figure 2, there are only two paths to reach the goal. In terms of the discounted reward, the shorter path is more favorable. To make the problem more interesting, we provide the longer suboptimal path as the demonstrations. We found that in the learning from demonstration phase, both DQfD and NAC have learned the suboptimal path since both methods do not have access to the environment and could not possibly figure out the optimal path. When the two methods finetune their policies in the environment, NAC succeeds in finding the optimal path, while DQfD stucks with the suboptimal one. It is because DQfD has the imitation loss, thus preventing it from deviating from the original solution.
5.4 Comparison to Other Methods
We compare our NAC method with other methods on 300k transitions. The demonstrations are collected by a trained Qlearning expert policy. We execute the policy in the environment to collect demonstrations. To avoid deterministic executions of the expert policy, we sample an action randomly with probability 0.01.
To explicitly compare different methods, we show separate figures for performances on the demonstrations and inside the environments. In Fig 3, we show that our method performs better than other methods on demonstrations. When we start finetuning, the performance of our method continues to increase and reaches peak performance faster than other methods. DQfD (Hester et al., 2017) has similar behavior to ours but has lower performance. Behavior cloning learns well on demonstrations, but it has a significant performance drop while interacting with environments. All the methods can ultimately learn by interacting with the environment but only our method and DQfD start from a relatively high performance. Empirically, we found that the importance weighted NAC method does not perform as well as NAC. The reason might be the decrease in the gradient bias is not offset sufficiently by the increase in the gradient variance. Without the demonstration data, Qlearning (Q w/o demo) and soft Qlearning (softQ w/o demo) suffer from low performance during the initial interactions with the environment. The original PCLR method (PCLR w/o demo) fails to learn even when trained from scratch in the environments. The improved PCL method (PCL) is not able to learn on the demonstrations, but it can learn in the environment.
We also test our method on the challenging GTA environment, where both the visual input and the game logic are more complex. Due to the limit of the environment execution speed, we only compare our method with DQfD. As shown in Fig. 3(a), our method outperforms DQfD both on the demonstrations and inside the environment.
5.5 Learning from Human Demonstrations
For many practical problems, such as autonomous driving, we might have a large number of human demonstrations, but no demonstration available from a trained agent at all. In contrast to a scripted agent, humans usually perform actions diversely, both from multiple individuals (e.g. conservative players will slow down before a Uturn; aggressive players will not) and a single individual (e.g. a player may randomly turn or go straight at an intersection). Many learning from demonstration methods do not study this challenging case, such as (Ho & Ermon, 2016). We study how different methods perform with diverse demonstrations. To collect human demonstrations, we asked 3 nonexpert human players to play TORCS for 3 hours each. Human players control the game with the combination of four arrow keys, at 5Hz, the same rate as the trained agent. In total, we collected around 150k transitions. Among them, 4.5k transitions are used as a validation set to monitor the Bellman error. Comparing with data collected from a trained agent, the data is more diverse and the quality of the demonstrations improves naturally when the players get familiar with the game.
In Fig. 3(b), we observe that the behavior cloning method performs much worse than NAC and DQfD. DQfD initially is better than our method but later is surpassed by the NAC method quickly, which might be caused by the supervised hinge loss being harmful when demonstrations are suboptimal. Similar to the policy generated demonstrations case, PCL, hard Qlearning and soft Qlearning do not perform well.
5.6 Effects of Imperfect Demonstrations
In the real world, collected demonstrations might be far from optimal. The human demonstrations above have already shown that imperfect demonstrations could have a large effect on performance. To study this phenomenon in a principled manner, we collect a few versions of demonstrations with varying degrees of noise. When collecting the demonstrations with the trained Q agent, we corrupt a certain percentage of the demonstrations by choosing nonoptimal actions (${argmin}_{a}Q(s,a)$). The data corruption process is conducted while interacting with the environment; therefore, the error will affect the collection of the following steps. We get 3 sets of $\{30\%,50\%,80\%\}$ percentage of imperfect data. In the left of Fig. 5, we show that our method performs well compared with DQfD and behavior cloning methods. Supervised behavior cloning method is heavily influenced by the imperfect demonstrations. DQfD is also heavily affected, but not as severely as the behavior cloning. NAC is robust because it does not imitate the suboptimal behaviors. The results for 50% and 80% percentage of imperfect data are similar, and they are available in the appendix.
5.7 Effects of Demonstration Size
In this section, we show comparisons among our method and other methods with different amounts of demonstration data. We use a trained agent to collect three sets of demonstrations which include 10k, 150k, and 300k transitions each. In the experiments, we find that our algorithm performs well when the amount of data is large and is comparable to supervised methods even with a limited amount of data. In Fig. 5 (right), we show when there are extremely limited amounts of demonstration data (10k transitions or 30 minutes of experience), our method performs on par with supervised methods. In the appendix, we show the results for 150k and 300k transitions: our method outperforms the baselines by a large margin with 300k transitions. In summary, our method can learn from small amounts of demonstration data and dominates in terms of performance when there is a sufficient amount of data.
5.8 Effects of Reward Choice
In the above experiments, we adopt a natural reward: it maximizes speed along the lane, minimizes speed perpendicular to the lane and penalizes when the agent hits anything. However, very informative rewards are not available under many conditions. In this section, we study whether our method is robust to a less informative reward. We change the reward function to be the square of the speed of an agent, irrespective of the speed’s direction. This reward encourages the agent to drive fast, however, it is difficult to work with because the agent has to learn by itself that driving offroad or hitting obstacles reduce its future speed. It is also hard because $spee{d}^{2}$ has a large numerical range. Figure 6 shows that NAC method still performs the best at convergence, while DQfD suffers from severe performance degeneration.
6 Conclusion
We proposed a Normalized ActorCritic algorithm for reinforcement learning from demonstrations. Our algorithm provides a unified approach for learning from reward and demonstrations, and is robust to potentially suboptimal demonstration data. An agent can be finetuned with rewards after training on demonstrations by simply continuing to perform the same algorithm on onpolicy data. Our algorithm preserves and improves the behaviors learned from demonstrations while receiving reward through interaction with an environment.
References
 Abbeel & Ng (2004) Abbeel, Pieter and Ng, Andrew Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twentyfirst international conference on Machine learning, pp. 1. ACM, 2004.
 Bojarski et al. (2016) Bojarski, Mariusz, Del Testa, Davide, Dworakowski, Daniel, Firner, Bernhard, Flepp, Beat, Goyal, Prasoon, Jackel, Lawrence D, Monfort, Mathew, Muller, Urs, Zhang, Jiakai, et al. End to end learning for selfdriving cars. arXiv preprint arXiv:1604.07316, 2016.
 Brockman et al. (2016) Brockman, Greg, Cheung, Vicki, Pettersson, Ludwig, Schneider, Jonas, Schulman, John, Tang, Jie, and Zaremba, Wojciech. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 Chemali & Lazaric (2015) Chemali, Jessica and Lazaric, Alessandro. Direct policy iteration with demonstrations. In IJCAI, pp. 3380–3386, 2015.
 Degris et al. (2012) Degris, Thomas, White, Martha, and Sutton, Richard S. Offpolicy actorcritic. arXiv preprint arXiv:1205.4839, 2012.
 Ebrahimi et al. (2017) Ebrahimi, Sayna, Rohrbach, Anna, and Darrell, Trevor. Gradientfree policy architecture search and adaptation. In Levine, Sergey, Vanhoucke, Vincent, and Goldberg, Ken (eds.), Proceedings of the 1st Annual Conference on Robot Learning, volume 78 of Proceedings of Machine Learning Research, pp. 505–514. PMLR, 13–15 Nov 2017. URL http://proceedings.mlr.press/v78/ebrahimi17a.html.
 Gu et al. (2016) Gu, Shixiang, Lillicrap, Timothy, Ghahramani, Zoubin, Turner, Richard E, and Levine, Sergey. Qprop: Sampleefficient policy gradient with an offpolicy critic. arXiv preprint arXiv:1611.02247, 2016.
 Gu et al. (2017) Gu, Shixiang, Lillicrap, Timothy, Ghahramani, Zoubin, Turner, Richard E, Schölkopf, Bernhard, and Levine, Sergey. Interpolated policy gradient: Merging onpolicy and offpolicy gradient estimation for deep reinforcement learning. arXiv preprint arXiv:1706.00387, 2017.
 Haarnoja et al. (2017a) Haarnoja, Tuomas, Tang, Haoran, Abbeel, Pieter, and Levine, Sergey. Reinforcement learning with deep energybased policies. arXiv preprint arXiv:1702.08165, 2017a.
 Haarnoja et al. (2017b) Haarnoja, Tuomas, Tang, Haoran, Abbeel, Pieter, and Levine, Sergey. Reinforcement learning with deep energybased policies. In Precup, Doina and Teh, Yee Whye (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1352–1361, International Convention Centre, Sydney, Australia, 06–11 Aug 2017b. PMLR. URL http://proceedings.mlr.press/v70/haarnoja17a.html.
 Hester et al. (2017) Hester, Todd, Vecerik, Matej, Pietquin, Olivier, Lanctot, Marc, Schaul, Tom, Piot, Bilal, Sendonaris, Andrew, DulacArnold, Gabriel, Osband, Ian, Agapiou, John, et al. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017.
 Ho & Ermon (2016) Ho, Jonathan and Ermon, Stefano. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pp. 4565–4573, 2016.
 Jaderberg et al. (2016) Jaderberg, Max, Mnih, Volodymyr, Czarnecki, Wojciech Marian, Schaul, Tom, Leibo, Joel Z, Silver, David, and Kavukcuoglu, Koray. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
 Kakade et al. (2003) Kakade, Sham Machandranath et al. On the sample complexity of reinforcement learning. PhD thesis, University of London London, England, 2003.
 Kim et al. (2013) Kim, Beomjoon, massoud Farahmand, Amir, Pineau, Joelle, and Precup, Doina. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pp. 2859–2867, 2013.
 Mnih et al. (2015) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Munos et al. (2016) Munos, Rémi, Stepleton, Tom, Harutyunyan, Anna, and Bellemare, Marc. Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1054–1062, 2016.
 Nachum et al. (2017) Nachum, Ofir, Norouzi, Mohammad, Xu, Kelvin, and Schuurmans, Dale. Bridging the gap between value and policy based reinforcement learning. arXiv preprint arXiv:1702.08892, 2017.
 Ng et al. (2000) Ng, Andrew Y, Russell, Stuart J, et al. Algorithms for inverse reinforcement learning. In Icml, pp. 663–670, 2000.
 Piot et al. (2014) Piot, Bilal, Geist, Matthieu, and Pietquin, Olivier. Boosted bellman residual minimization handling expert demonstrations. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 549–564. Springer, 2014.
 Ross et al. (2011) Ross, Stéphane, Gordon, Geoffrey J, and Bagnell, Drew. A reduction of imitation learning and structured prediction to noregret online learning. In International Conference on Artificial Intelligence and Statistics, pp. 627–635, 2011.
 Schulman et al. (2017) Schulman, John, Abbeel, Pieter, and Chen, Xi. Equivalence between policy gradients and soft qlearning. arXiv preprint arXiv:1704.06440, 2017.
 Shelhamer et al. (2016) Shelhamer, Evan, Mahmoudieh, Parsa, Argus, Max, and Darrell, Trevor. Loss is its own reward: Selfsupervision for reinforcement learning. arXiv preprint arXiv:1612.07307, 2016.
 Sutton & Barto (1998) Sutton, Richard S and Barto, Andrew G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
 Thie (1983) Thie, Paul R. Markov decision processes. Comap, Incorporated, 1983.
 Todorov (2008) Todorov, Emanuel. General duality between optimal control and estimation. In Decision and Control, 2008. CDC 2008. 47th IEEE Conference on, pp. 4286–4292. IEEE, 2008.
 Toussaint (2009) Toussaint, Marc. Robot trajectory optimization using approximate inference. In Proceedings of the 26th annual international conference on machine learning, pp. 1049–1056. ACM, 2009.
 Wang et al. (2016) Wang, Ziyu, Bapst, Victor, Heess, Nicolas, Mnih, Volodymyr, Munos, Remi, Kavukcuoglu, Koray, and de Freitas, Nando. Sample efficient actorcritic with experience replay. arXiv preprint arXiv:1611.01224, 2016.
 Wang et al. (2017) Wang, Ziyu, Merel, Josh, Reed, Scott, Wayne, Greg, de Freitas, Nando, and Heess, Nicolas. Robust imitation of diverse behaviors. arXiv preprint arXiv:1707.02747, 2017.
 Watkins & Dayan (1992) Watkins, Christopher JCH and Dayan, Peter. Qlearning. Machine learning, 8(34):279–292, 1992.
 Xu et al. (2016) Xu, Huazhe, Gao, Yang, Yu, Fisher, and Darrell, Trevor. Endtoend learning of driving models from largescale video datasets. arXiv preprint arXiv:1612.01079, 2016.
 Ziebart (2010) Ziebart, Brian D. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010.
 Ziebart et al. (2008) Ziebart, Brian D, Maas, Andrew L, Bagnell, J Andrew, and Dey, Anind K. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pp. 1433–1438. Chicago, IL, USA, 2008.
7 Appendix
7.1 Normalized ActorCritic with Q Parametrization
Usually the actorcritic method parametrizes $\pi (s,a)$ and $V(s)$ with a neural network that has two heads. In this section, we explore an alternative parametrization: Qparametrization. Instead of outputting $\pi $ and $V$ directly, the neural network computes $Q(s,a)$. We parametrize $\pi $ and $V$ based on $Q$ by specifying a fixed mathematical transform:
${V}_{Q}(s)$  $=\alpha \mathrm{log}{\displaystyle \sum _{a}}\mathrm{exp}(Q(s,a)/\alpha )$  (15)  
${\pi}_{Q}(as)$  $=\mathrm{exp}((Q(s,a){V}_{Q}(s))/\alpha )$  (16) 
Note that the Qparametrization we propose here can be seen as a specific design of the network architecture. Instead of allowing the net to output arbitrary $\pi (s,a)$ and $V(s)$ values, we restrict the network to only output $\pi (s,a)$ and $V(s)$ pairs that satisfy the above relationship. This extra restriction will not harm the network’s ability to learn since the above relationship has to be satisfied at the optimal solution(Schulman et al., 2017; Haarnoja et al., 2017b; Nachum et al., 2017).
Based on the Qparametrization, we can derive the update of the actor. Note that we assume the behavioral policy is ${\pi}_{Q}$, and we sample one step out of a trajectory, thus dropping the subscript $t$. The goal is to maximize expected future reward, thus taking gradient of it we get:
$\begin{array}{cc}& \nabla {\mathbb{E}}_{s,a\sim {\pi}_{Q}}\left[R(s,a)\right]\hfill \\ \hfill =& {\mathbb{E}}_{s,a\sim {\pi}_{Q}}\left[R(s,a)\nabla {\mathrm{log}}_{\theta}p(a,s{\pi}_{Q})\right]\hfill \\ \hfill \approx & {\mathbb{E}}_{s,a\sim {\pi}_{Q}}\left[R(s,a){\nabla}_{\theta}\mathrm{log}{\pi}_{Q}(as)\right]\hfill \end{array}$  (17) 
where the last step ignores the state distribution, thus an approximation. By adding some baseline functions, it turns to the following format, where $\widehat{Q}(s,a)=R(s,a)+\gamma {V}_{Q}({s}^{\prime})$:
$\begin{array}{cc}& {\mathbb{E}}_{s,a}\left[{\nabla}_{\theta}\mathrm{log}{\pi}_{Q}(as)(\widehat{Q}(s,a)b(s))\right]\hfill \\ \hfill =& {\mathbb{E}}_{s}\left[{\displaystyle \sum _{a}}{\pi}_{Q}(as){\nabla}_{\theta}\mathrm{log}{\pi}_{Q}(as)(\widehat{Q}(s,a)b(s))\right]\hfill \end{array}$  (18) 
As in previous work, an entropyregularized policy gradient simply add the gradient of the entropy of the current policy with a tunable parameter $\alpha $ in order to encourage exploration. The entropy term is:
$$\begin{array}{cc}& {\mathbb{E}}_{s}\left[\alpha {\nabla}_{\theta}H({\pi}_{Q}(\cdot s))\right]\hfill \\ \hfill =& {\mathbb{E}}_{s}\left[\alpha {\nabla}_{\theta}\sum _{a}{\pi}_{Q}(as)\mathrm{log}{\pi}_{Q}(as)\right]\hfill \\ \hfill =& {\mathbb{E}}_{s}[\alpha \sum _{a}{\nabla}_{\theta}{\pi}_{Q}(as)\mathrm{log}{\pi}_{Q}(as)\hfill \\ & {\pi}_{Q}(as){\nabla}_{\theta}\mathrm{log}{\pi}_{Q}(as)]\hfill \\ \hfill =& {\mathbb{E}}_{s}[\alpha \sum _{a}{\nabla}_{\theta}{\pi}_{Q}(as)\mathrm{log}{\pi}_{Q}(as)\hfill \\ & {\pi}_{Q}(as)\frac{1}{{\pi}_{Q}(as)}{\nabla}_{\theta}{\pi}_{Q}(as)]\hfill \\ \hfill =& {\mathbb{E}}_{s}\left[\alpha \sum _{a}{\nabla}_{\theta}{\pi}_{Q}(as)\mathrm{log}{\pi}_{Q}(as)\right]\hfill \\ \hfill =& {\mathbb{E}}_{s}\left[\alpha \sum _{a}{\pi}_{Q}(as){\nabla}_{\theta}\mathrm{log}{\pi}_{Q}(as)\mathrm{log}{\pi}_{Q}(as)\right]\hfill \end{array}$$  (19) 
putting the two terms together and using the energybased policy formulation (Eq. (16)) :
$\begin{array}{cc}& {\mathbb{E}}_{s,a}[{\nabla}_{\theta}\mathrm{log}{\pi}_{Q}(as)(\widehat{Q}(s,a)b(s))+\alpha {\nabla}_{\theta}H({\pi}_{Q}(\cdot s))]\hfill \\ \hfill =& {\mathbb{E}}_{s}[{\displaystyle \sum _{a}}{\pi}_{Q}(s,a){\nabla}_{\theta}\mathrm{log}{\pi}_{Q}(s,a)\hfill \\ & (\widehat{Q}(s,a)b(s)(Q(s,a){V}_{Q}(s)))]\hfill \end{array}$ 
If we let the baseline $b(s)$ be ${V}_{Q}(s)$, we get the update:
$\begin{array}{cc}& {\mathbb{E}}_{s}\left[{\displaystyle \sum _{a}}{\pi}_{Q}(s,a){\nabla}_{\theta}\mathrm{log}{\pi}_{Q}(s,a)(\widehat{Q}(s,a)Q(s,a))\right]\hfill \\ \hfill =& {\displaystyle \frac{1}{\alpha}}{\mathbb{E}}_{s,a}\left[({\nabla}_{\theta}Q(s,a){\nabla}_{\theta}{V}_{Q}(s))(\widehat{Q}(s,a)Q(s,a))\right]\hfill \end{array}$  (20) 
where $\widehat{Q}(s,a)$ could be obtained through bootstrapping by $R(s,a)+\gamma {V}_{Q}({s}^{\prime})$. In practice ${V}_{Q}({s}^{\prime})$ is computed from a target network. For the critic, the update is:
$\begin{array}{cc}& {\mathbb{E}}_{s}\left[{\nabla}_{\theta}{\displaystyle \frac{1}{2}}{({V}_{Q}(s)\widehat{V}(s))}^{2}\right]\hfill \\ \hfill =& {\mathbb{E}}_{s}\left[{\nabla}_{\theta}{V}_{Q}(s)({V}_{Q}(s)\widehat{V}(s))\right]\hfill \end{array}$  (21) 
where $\widehat{V}(s)$ could be similarly obtained by bootstrapping: $\widehat{V}(s)={\mathbb{E}}_{a}[R(s,a)+\gamma {V}_{Q}({s}^{\prime})]+\alpha H({\pi}_{Q}(\cdot s))$.
7.2 Effects of Imperfect Demonstrations
See Figure 8 for more results for imperfect demonstrations when the amount of noise varies.
7.3 Effects of Demonstration Amount
See Figure 7 for more results on the effect of the demonstration amount.
7.4 Experiment Details
Network Architecture: We use the same architecture as in (Mnih et al., 2015) to parametrize $Q(s,a)$. With this Q parametrization, we also output ${\pi}_{Q}(as)$ and ${V}_{Q}(s)$ based on Eq. (16) and Eq. (15).
Hyperparameters: We use a replay buffer with a capacity of 1 million steps and update the target network every 10K steps. Initially, the learning rate is linearly annealed from 1e4 to 5e5 for the first $1/10$ of the training process and then it is kept as a constant (5e5). Gradients are clipped at 10 to reduce training variance. The reward discount factor $\gamma $ is set to 0.99. We concatenate the 4 most recent frames as the input to the neural network. For the methods with an entropy regularizer, we set $\alpha $ to 0.1, following (Schulman et al., 2017). We truncate the importance sampling weighting factor $\beta =\mathrm{min}\{\frac{{\pi}_{Q}(as)}{\mu (as)},c\}$ at 10, i.e., $c=10$.