Abstract
In this paper, we study Reinforcement Learning from Demonstrations (RLfD)that improves the exploration efficiency of Reinforcement Learning (RL) byproviding expert demonstrations. Most of existing RLfD methods requiredemonstrations to be perfect and sufficient, which yet is unrealistic to meetin practice. To work on imperfect demonstrations, we first define an imperfectexpert setting for RLfD in a formal way, and then point out that previousmethods suffer from two issues in terms of optimality and convergence,respectively. Upon the theoretical findings we have derived, we tackle thesetwo issues by regarding the expert guidance as a soft constraint on regulatingthe policy exploration of the agent, which eventually leads to a constrainedoptimization problem. We further demonstrate that such problem is able to beaddressed efficiently by performing a local linear search on its dual form.Considerable empirical evaluations on a comprehensive collection of benchmarksindicate our method attains consistent improvement over other RLfDcounterparts.
Quick Read (beta)
Reinforcement Learning from Imperfect Demonstrations
under Soft Expert Guidance
Abstract
In this paper, we study Reinforcement Learning from Demonstrations (RLfD) that improves the exploration efficiency of Reinforcement Learning (RL) by providing expert demonstrations. Most of existing RLfD methods require demonstrations to be perfect and sufficient, which yet is unrealistic to meet in practice. To work on imperfect demonstrations, we first define an imperfect expert setting for RLfD in a formal way, and then point out that previous methods suffer from two issues in terms of optimality and convergence, respectively. Upon the theoretical findings we have derived, we tackle these two issues by regarding the expert guidance as a soft constraint on regulating the policy exploration of the agent, which eventually leads to a constrained optimization problem. We further demonstrate that such problem is able to be addressed efficiently by performing a local linear search on its dual form. Considerable empirical evaluations on a comprehensive collection of benchmarks indicate our method attains consistent improvement over other RLfD counterparts.
Reinforcement Learning from Imperfect Demonstrations
under Soft Expert Guidance
Mingxuan Jing^{†}^{†}thanks: These two authors contributed equally. Correspondence to Fuchun Sun.${}^{\mathrm{1}}$, Xiaojian Ma^{1}^{1}footnotemark: 1 ${}^{\mathrm{1}}$${}^{\mathrm{2}}$, Wenbing Huang${}^{\mathrm{1}}$, Fuchun Sun${}^{\mathrm{1}}$, Chao Yang${}^{\mathrm{1}}$, Bin Fang${}^{\mathrm{1}}$, Huaping Liu${}^{\mathrm{1}}$ ${}^{1}$Beijing National Research Center for Information Science and Technology (BNRist), State Key Lab on Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China ${}^{2}$Center for Vision, Cognition, Learning and Autonomy, Department of Computer Science, UCLA, CA 90095, USA [email protected], [email protected] [email protected], [email protected]
Copyright © 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
1 Introduction
Reinforcement Learning (RL) (?) enables robots to acquire skills by interacting with the environment. Despite the conspicuous advancements they have attained, typical RL methods suffer from the exploration issue that performing exploration over novel actionstate trajectories is inefficient, and is not spontaneously guaranteed when the reward signals are sparse or incomplete. Thus, a fairly of approaches (?; ?; ?; ?; ?) have resorted to the combination of RL with expert demonstrations (containing actionstate trajectories), giving rise to a new research vein that exploits expert demonstrations to help policy exploration of the agent. We refer this vein as Reinforcement Learning from Demonstrations (RLfD) in this paper.
Early RLfD methods enhance RL by either putting expert demonstrations into a replay buffer for value estimation (?; ?) or applying them to pretrain the policy in a supervised manner (?; ?), both of which, however, simply regard demonstrations as dataaugmentations without making full use of them during the policy optimization procedure. To address this weakness, modern RLfD approaches (?; ?) absorb ingredients from Imitation Learning (?; ?; ?; ?; ?; ?), and encourage the agent to mimic the demonstrated behaviors when the environmental feedback is rare or even unavailable. Specifically, they reshape the native reward in RL by adding a demonstrationguided term to force expertalike exploration.
Whereas encouraging expertalike actions does help in avoiding futile exploration, continuously enforcing such type of rewards during the whole learning phase is problematic if the provided demonstrations are imperfect. Here, the notion of imperfect demonstrations implies two senses: I. The quality of demonstrations is imperfect, which could be caused by data collection noise or intrinsically produced by the immaturity of the expert. II. The number of demonstrations is insufficient, which is due to the consuming resource and effort in collection. The imperfectness of demonstrations will potentially, if not always, make the convergence of the agent policy to be suboptimal. As illustrated in Figure 1 and nonstrictly speaking, the learned agent policy by existing RLfD methods will converge to a point nearby the underlying expert policy. If the demonstrations/expert policies are imperfect, we have no guarantee to obtain better agent policy (or even have a potential detriment to the policy searching) by always minimizing its divergence to the expert behavior.
In this paper, we propose to conduct RL from imperfect demonstrations by applying expert guidance in a soft way. To illustrate our idea, let us revisit the example in Figure 1. We assume that the optimal agent policy still locates within a certain region around the imperfect expert policy (denoted by the red area in the Right of Figure 1), and once the agent policy is within this region, its optimization is only affected by the interaction with the environment and is no longer influenced by demonstrations. The intuition behind is that the expert demonstrations—even when they are imperfect—are able to characterize what actions are good in general but not precisely. Conventional RLfD methods fix the demonstration reward during the whole learning procedure and are not flexible enough to meet our requirement.
Towards our purpose, we reformulate the RLfD task as a constrained policy optimization problem (?; ?; ?), where the goal is formulated by the native RL objective and the constraint is to bound the exploration region around the demonstrations under a certain threshold. By this formulation, the expert demonstrations regulate the agent policy updating only when the policy is outside the constraint region, which is consistent with our assumption mentioned above. Nevertheless, solving the constrained optimization problem is nontrivial. To tackle it effectively, we propose to search the optimal policy update for each step with a linearized subobjective. Through leveraging its dual form, we can significantly reduce the problem size and empowers the scalability to policy models with highdimensional parameter space like neural networks. We provide more details in Sec. 3.
We summarize our contributions as follows.

•
To the best of our knowledge, we are the first to formulate RLfD as a constrained optimization problem, by which we are able to make full use of imperfect demonstrations in a soft and also more effective manner.

•
We develop an efficient method to solve the proposed constrained optimization problem with scalable policy model like deep neural networks.

•
With imperfect demonstrations, our method achieves consistent improvement over other RLfD counterparts on several challenging physical control benchmarks.
The rest of paper is organized as follows: In Sec. 2, we first provide necessary notations and preliminaries about the subject of RLfD. Then our proposed method will be detailed in Sec. 3 with analysis and efficient implementation. The discussion on some related research will be included in Sec. 4. Finally, experimental evaluations will be demonstrated in Sec. 5.
2 Preliminaries
Notations. For modeling the action decision process in our context, a standard Markov decision process (MDP) (?) $(\mathcal{S},\mathcal{A},r,\mathcal{T},\mu ,\gamma )$ is considered, where $\mathcal{S}$ and $\mathcal{A}$ denotes the space of feasible states and actions respectively, $r(s,a)\to \mathbb{R}$ is the reward function, $\mathcal{T}({s}^{\prime}s,a)$ and $\mu (s)$ represent the transition probability and initial state distribution and $\gamma \in (0,1)$ is the discount factor. A stochastic policy $\pi (as):\mathcal{S}\times \mathcal{A}\to [0,1]$ maps state into action distribution. A trajectory $\zeta $ is given by the sequence of stateaction pairs $\{({s}_{0},{a}_{0}),({s}_{1},{a}_{1}),\mathrm{\dots}\}$.
Occupancy measure. The concept of occupancy measure (?; ?) defined below characterizes the distribution of the stateaction pairs within the exploration trajectories when policy $\pi $ is executed, which will be useful in the following analysis.
Definition 1 (Occupancy Measure).
Given a stationary policy $\pi $, let ${\rho}_{\pi}\mathit{}\mathrm{(}s\mathrm{)}\mathrm{:}\mathrm{S}\mathrm{\to}\mathrm{R}$ and ${\rho}_{\pi}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}\mathrm{:}\mathrm{S}\mathrm{\times}\mathrm{A}\mathrm{\to}\mathrm{R}$ denote the density of the state distribution and the joint distribution for state and action under the policy $\pi $,
$\begin{array}{cc}& {\rho}_{\pi}(s)\triangleq {\displaystyle \sum _{t=0}^{\mathrm{\infty}}}{\gamma}^{t}P({s}_{t}=s\pi )\hfill \\ & {\rho}_{\pi}(s,a)\triangleq {\rho}_{\pi}(s)\pi (as)\mathit{\text{.}}\hfill \end{array}$  (1) 
We name ${\rho}_{\pi}\mathit{}\mathrm{(}s\mathrm{,}a\mathrm{)}$ as occupancy measure of policy $\pi $.
Formulation of RLfD. The objective of RL is to maximize the cumulative expected (discounted) return along the whole decision procedure $\eta (\pi )={\mathbb{E}}_{\pi}[{\sum}_{t}^{\mathrm{\infty}}{\gamma}^{t}r({s}_{t},{a}_{t})]$, given current action policy $\pi $ (?). While RLfD enhances RL with providing a set of demonstrated trajectories $\mathcal{D}=\{{\zeta}_{0},\mathrm{\cdots}\}$ draw from a referred expert with policy ${\pi}_{E}$ as an extra guidance other than reward. Such expert data can be useful notably when the environmental feedback is sparse or delayed (?), in which the agent may suffer from ineffective explorations since positive feedback could rarely occur.
RLfD with penalty departures. Some previous research(?; ?) suggest exploring towards the area that frequently visited by expert policy ${\pi}_{E}$, because it may provide a higher and denser return that agent can benefit from. As it mentioned above, such visiting frequency is essentially characterized by expert’s occupancy measure. Intuitively, we can leverage the distribution discrepancy between the occupancy measure of expert and agent as an extra feedback signal to encourage this exploration behavior, which gives us the following objective
$\underset{\pi}{\mathrm{min}}{\mathcal{L}}_{\pi}=\eta (\pi )+\lambda \cdot \mathbb{D}\left[{\rho}_{\pi}(s,a)\parallel {\rho}_{E}(s,a)\right]\text{,}$  (2) 
where $\mathbb{D}(\cdot \parallel \cdot )$ and ${\rho}_{E}(s,a)$ depict any discrepancy measure and expert’s occupancy measure respectively, $\lambda $ is an adjustable weight. We refer (2) as RLfD with penalty departures in the following context since the discrepancy is introduced as a penalty function upon the original objective of RL and can be approximated through expert demonstrations.
3 Methodology
In this section, we will first introduce the new setting of imperfect expert for RLfD and emphasize the optimality and convergence issues in the penalty method, which essentially motivates our approach to employ expert guidance as a soft constraint instead. We also demonstrate that such constrained optimization problem can be solved efficiently by performing a local linear search on its dual form, maintaining its scalability to complex policy model like deep neural networks. Finally, we provide a practical implementation of our method.
3.1 Expert Guidance as a Soft Constraint: Towards RLfD with an Imperfect Expert
We now illustrate the imperfect setting for RLfD. As it mentioned in Sec. 1, the imperfectness here is raised from two facets: quality and amount of available demonstrations. Here we first focus on the quality, and the issue on the amount of demonstrations will be discussed later. Compared to the perfect expert setting that assumed the expert policy has already maximized the expected return (?; ?), an imperfect expert employs a policy that still not converge to an expected local optimum w.r.t. the considered RL objective. Without loss of generality, an imperfect expert, can be defined as follows.
Definition 2 (Imperfect Expert Policy).
Denoting ${\pi}_{{\theta}^{\mathrm{+}}}$ and ${\pi}_{{\theta}^{\mathrm{}}}$ as perfect and imperfect expert policies respectively. ${\pi}_{{\theta}^{\mathrm{}}}$ either attains local optimum with a lower return than ${\pi}_{{\theta}^{\mathrm{+}}}$ or does not belong to any local optima.
${\pi}_{{\theta}^{+}}\in \{\pi :\underset{\pi}{\mathrm{arg}\mathrm{max}}\eta (\pi )\mathtt{\text{AND}}{\displaystyle \frac{\partial \eta ({\pi}_{{\theta}^{+}})}{\partial {\theta}^{+}}}=0\}$  
$$ 
where $\eta \mathit{}\mathrm{(}\pi \mathrm{)}$ is the objective of currently considered RL task.
The penalty method presented in Sec. 2 works comparably well when expert is optimal (?; ?). However, optimizing the composite sum of two parts in (2) under imperfect expert setting is problematic, as it may alter the optimality and induces no convergence guarantee for the original RL objective. The following propositions illustrate this issue formally.
Proposition 1.
Denoting ${\pi}_{{\theta}^{\mathrm{\star}}}\mathrm{=}{\mathrm{arg}\mathit{}\mathrm{max}}_{\pi}\mathit{}\eta \mathit{}\mathrm{(}\pi \mathrm{)}$ as the optimal policy under the given RL objective $\eta \mathit{}\mathrm{(}\pi \mathrm{)}$. Then for the additional distribution discrepancy term $\mathrm{D}\mathit{}\mathrm{\left[}{\rho}_{\pi}\mathrm{\parallel}{\rho}_{E}\mathrm{\right]}$ in (2), ${\frac{\mathrm{\partial}\mathit{}\mathrm{D}\mathit{}\mathrm{[}{\rho}_{{\pi}_{\theta}}\mathrm{\parallel}{\rho}_{{\pi}_{{\theta}^{\mathrm{+}}}}\mathrm{]}}{\mathrm{\partial}\mathit{}\theta}\mathrm{}}_{\theta \mathrm{=}{\theta}^{\mathrm{\star}}}\mathrm{=}\mathrm{0}$. But when an imperfect expert ${\pi}_{{\theta}^{\mathrm{}}}$ is adopted, this result does not hold under certain conditions.
This proposition presents an intuitive result that the optimal policy for a given RL task can’t always be an optimum of the additional discrepancy term in (2) under imperfect demonstrations. We will further show that this may make (2) converge to a solution that is nonoptimal for the original RL problem.
Proposition 2.
When the penalty method (2) under imperfect demonstrations converges to a local optimum ${\pi}_{\theta}$, it can’t always be the optimal solution for the original RL objective.
$\frac{\partial {\mathcal{L}}_{{\pi}_{\theta}}}{\partial \theta}}=0\nRightarrow {\pi}_{\theta}=\underset{\pi}{\mathrm{arg}\mathrm{max}}\eta (\pi )\mathit{\text{.}$ 
While under the same certain condition as Proposition 1, we can obtain an even stronger conclusion
$$ 
The two propositions above imply that, under the imperfect setting, the additional penalty term will substantially change the optimization landscape of original RL problem and may induce convergence to a nonoptimal solution. Although it can offer positive guidance in the early training phase, it will soon become misleading and prevent the policy from attaining higher return. To tackle this issue, we propose to transform the distribution discrepancy penalty term into a constraint instead. This intuition is actually based on the following observation:
Proposition 3.
There exists a bounded tolerance factor $d$ such that the optimal policy ${\pi}_{{\theta}^{\mathrm{\star}}}$ always stay within an area closer to the demonstrations specified by $d$, even when the demonstrations are drawn from an imperfect expert.
$\exists d\in [0,\mathrm{\infty}),\mathbb{D}\left[{\rho}_{{\pi}_{{\theta}^{\star}}}\parallel {\rho}_{{\pi}_{{\theta}^{}}}\right]\u2a7dd,{\pi}_{{\theta}^{\star}}=\underset{\pi}{\mathrm{arg}\mathrm{max}}\eta (\pi )\mathit{\text{.}}$ 
From the perspective of optimization, it suggests that using constraint could better fit the imperfect expert setting by two reasons. 1. Optimality. Refer to Proposition 3, given a proper tolerance $d$, once the optimal policy satisfies the constraint, the new constrained optimization problem will share the same optimal solution with the original RL problem. 2. Convergence. The constraint only affects policy update when it is not satisfied. Therefore, when the policy improves to a certain extent, i.e. stays within the constraint, it will only learn from the original reward feedback and finally converge to the optimality of the original RL problem. As a conclusion, compared to the penalty method, the constraint method can leverage the imperfect demonstrations for guiding the policy while eliminating their side effects in optimization, thus can work better with imperfect experts.
For another facet of imperfectness, i.e. amount, as the expert data is mainly introduced for computing the distribution discrepancy in our context, the issue of insufficient amount of demonstrations will essentially rely on the estimation error to the discrepancy, which may induce bias to policy update especially when the gradient step is relatively large. We refer to the idea of local policy search (?; ?) to alleviate this issue by making conservative gradient step instead with an auxiliary constraint on the change of KullbackLeibler (KL) divergence of the updated policy.
Overall optimization problem. By combining the discrepancy constraint and local policy search, the eventual optimization problem ($k$th step) with imperfect expert ${\pi}_{{\theta}^{}}$ is
$\begin{array}{cc}\hfill {\theta}_{k+1}=\underset{\theta}{\mathrm{arg}\mathrm{max}}& \eta ({\pi}_{{\theta}_{k}})\hfill \\ \hfill \text{s.t.}& \mathbb{D}\left[{\rho}_{{\pi}_{{\theta}_{k}}}(s,a)\parallel {\rho}_{{\pi}_{{\theta}^{}}}(s,a)\right]\u2a7d{d}_{k}\hfill \\ & {\mathbb{D}}_{\mathrm{KL}}[{\pi}_{{\theta}_{k}}(as)\parallel {\pi}_{{\theta}_{k+1}}(as)]\u2a7d\delta \text{,}\hfill \end{array}$  (3) 
where $\delta $ is the tolerance of the KL constraint. The remaining issue now is how to determine the tolerance factor ${d}_{k}$ for the discrepancy constraint in each step. To avoid handcrafting this parameter on different tasks and demonstrations, we apply a simple annealing strategy on ${d}_{k}$ to realize a soft constraint as it can adapt along with the improvement of policy, comparing to a fixed tolerance. Specifically, we adopt the following update rule for ${d}_{k}$
${d}_{k+1}\leftarrow {d}_{k}+{d}_{k}\cdot \u03f5\text{,}$  (4) 
where $\u03f5$ is the annealing factor. We will further demonstrate the advantage on adopting a soft constraint and the strategy on hyperparameter choosing in our empirical evaluations in Sec. 5.4.
3.2 Solving with Scalable Policy Models
We’ve shown the issues of the penalty method for RLfD when the expert data is imperfect, and therefore motivated our new approach that reformulates it as a constrained policy optimization problem (3). Nevertheless, solving it accurately can be rather challenging due to: 1. Feasibility, it may be difficult to find a feasible solution with the two constraints. 2. Scalability, for policies that are characterized by a model with highdimensional parameter space, i.e. neural networks, the computation cost of the new constraint will become unaffordable. To this end, we propose to approximately solve it by linearizing around ${\pi}_{{\theta}_{k}}$ at each optimization step. Denoting the gradient of the objective as $g$, the current discrepancy at ${\theta}_{k}$ as ${d}_{{\theta}_{k}}$ and its gradient as $b$, the Hessian matrix of the KLdivergence as $H$^{1}^{1} 1 The KL constraint should be approximated via secondorder expansion since its first order gradient is zero at ${\pi}_{\theta}={\pi}_{{\theta}_{k}}$., the linear approximation to (3) is
$\begin{array}{cc}\hfill {\theta}_{k+1}=\underset{\theta}{\mathrm{arg}\mathrm{max}}& {g}^{T}(\theta {\theta}_{k})\hfill \\ \hfill \text{s.t.}& {b}^{T}(\theta {\theta}_{k})+{d}_{{\theta}_{k}}\u2a7d{d}_{k}\hfill \\ & {\displaystyle \frac{1}{2}}{(\theta {\theta}_{k})}^{T}H(\theta {\theta}_{k})\u2a7d\delta \text{.}\hfill \end{array}$  (5) 
The approximated optimization problem above is convex as $H$ is always positive semidefinite (?). Therefore, compared to its original form (3), a feasible solution can be found more easily using duality. In particular, given $\lambda $ and $\nu $ as the Lagrange multipliers for KLdivergence and discrepancy constraints, a corresponding dual to (5) can be written as
$\begin{array}{c}\hfill \underset{\genfrac{}{}{0pt}{}{\lambda \ge 0}{\nu \ge 0}}{\mathrm{max}}{\displaystyle \frac{1}{2\lambda}}({g}^{T}u+2\nu {b}^{T}u+{\nu}^{2}{b}^{T}r)\nu c\lambda \delta \text{,}\end{array}$  (6) 
where $u={H}^{1}g$, $r={H}^{1}b$, $c={d}_{k}{d}_{{\theta}_{k}}$. Since the number of variables in this dual problem is much less than the dimension of $\theta $, the computation cost will also be much less than solving (3). A closedform solution of optimal solution ${\lambda}^{\star}$, ${\nu}^{\star}$ can be derived by firstly obtaining and substituting ${\nu}^{\star}$, then discussing the subcase and finally gets ${\lambda}^{\star}$. Suppose we have the optimal solution ${\lambda}^{\star}$, ${\nu}^{\star}$ of this dual problem, the solution to the primal one will be
$\begin{array}{c}\hfill {\theta}_{k+1}^{\star}={\theta}_{k}{\displaystyle \frac{1}{{\lambda}^{\star}}}(u+r{\nu}^{\star})\text{.}\end{array}$  (7) 
When there is at least one feasible point within the KL constraint (the trust region), we can update the policy parameter $\theta $ by solving the dual for ${\lambda}^{\star}$ and ${\nu}^{\star}$ (7). However, due to the initialization and approximation error, the proposed update rule may sometimes not satisfy the constraints in (3), especially at the beginning of optimization. In the next section, we will provide more details on ensuring the feasibility.
[t!]
\[email protected]@algorithmic
\STATEInput: Imperfect expert demonstrations ${\mathcal{D}}_{E}=\{{\zeta}_{i}^{E}\}$, initial policy ${\pi}_{{\theta}_{0}}$, initial constraints tolerance ${d}_{0}$, $\delta $, annealing factor $\u03f5$, maximal iterations $N$.
\FORk = 0 to $N$
\STATESample rollout ${\mathcal{D}}_{\pi}$ with ${\pi}_{{\theta}_{k}}$.
\STATEEstimate $\widehat{g}$, $\widehat{b}$, $\widehat{H}$ with samples from ${\mathcal{D}}_{E}$ and ${\mathcal{D}}_{\pi}$.
\IFthe optimization problem (5) is feasible
\STATESolve the dual problem (6) to obtain ${\lambda}^{*}$, ${\nu}^{*}$.
\STATECompute update step proposal $\mathrm{\Delta}\theta $ as (7).
\STATEUpdate the policy by backtracking linesearch along $\mathrm{\Delta}\theta $ to ensure the satisfaction of constraints.
\ELSE\STATEUpdate the policy via the recovery objective (9).
\ENDIF
Annealing the tolerance ${d}_{k}$: ${d}_{k+1}\leftarrow {d}_{k}+{d}_{k}\cdot \u03f5$.
\ENDFOR
3.3 Implementation Details
The choice of discrepancy measure. In RLfD, as we can only access the samples (demonstrations) from the expert policy and its occupancy measure, we adopt the nonparametric distance metric MMD (?; ?; ?) as the discrepancy measure. The value and gradient w.r.t. policy parameters of MMD can be easily computed with demonstrations and agent rollouts. Moreover, we use the characteristic Gaussian kernel to ensure the following property
$\mathrm{MMD}[p,q]=0\iff p=q\text{,}$  (8) 
where $p$, $q$ denote two distributions. This property can help alleviate the inconsistency between minimizing discrepancy and morphing distributions within the discrepancy constraint and improve the optimization (?).
Feasibility issue. The major crux that accounts for the feasibility issue when solving (3) can be twofold. One lies in the beginning phase. As the parameter $\theta $ is usually randomly initialized, it may induce infeasibility when the optimization just starts. We propose a recovery strategy that transforms the constraint into an objective to eliminate this issue.
$\begin{array}{c}\hfill {\theta}^{\star}=\underset{\theta}{\mathrm{arg}\mathrm{min}}\mathbb{D}\left[{\rho}_{{\pi}_{\theta}}(s,a)\parallel {\rho}_{{\pi}_{{\theta}^{}}}(s,a)\right]\text{.}\end{array}$  (9) 
Another source of infeasibility comes from (7). The update rule may not satisfy the constraints due to the approximation error. To this end, we apply a backtracking linesearch along $\mathrm{\Delta}\theta =\lambda ^{\star}{}^{1}(u+r{\nu}^{\star})$ to ensure the constraint satisfaction. To further reduce the computation cost, we also adopt the conjugate gradient method like (?) to approximately compute the inverse of $H$ and its products.
The algorithm detail is summarized in Algorithm 3.2.
4 Discussion
In this section, we will discuss some relevant research on RLfD, and demonstrate how they connect to our method.
Pretrain with demonstrations. A straightforward solution for combining demonstrations in RL will be pretraining agent policy with expert data via imitation learning, e.g. behaviour cloning (?; ?), then proceeding with normal RL (?; ?). The first step is similar to our constrained optimization approach under unsatisfied constraints when the optimization starts, sometimes even have better performance at the beginning. However, this method cannot guarantee the exploration quality in the later RL step; thus the subsequent training can still suffer from poor sample efficiency in the case with large exploration space and sparse feedback.
Penalty with other discrepancy measures. There is also some research on investigating different discrepancy measures for RLfD with penalty departures (?; ?). Notable recent research is POfD (?), which proposed to leverage Generative Adversarial Networks (GANs) (?) to evaluate the discrepancy between the occupancy measure of expert and agent. In our comparative evaluations, it demonstrates comparative performances than baseline that employs MMD as penalty departures. However, this method requires an extra parameterized model (discriminator) and training procedure (adversarial training), which substantially increase the difficulty of convergence.
Penalty with annealing. In Sec. 3.1, we’ve mentioned that our constraint method adopts an annealing strategy to select the constraint factor adaptively. Since our method would expect the optimal policy to stay within the constraint, annealing is more practical than manually specifying a fixed factor for different task and demonstrations. Similarly, this strategy is also applicable to the factor $\lambda $ in penalty method (2) for suppressing the side effect of imperfect demonstrations. However, we should notice that annealing can only partly alleviate this impact before $\lambda $ becomes zero. While in our approach, only the original RL objective is being optimized once the constraint with imperfect expert data is satisfied. In fact, our empirical results in Sec. 5.2 indicate penalty with annealing does perform advantageously than pure penalty method in some evaluated tasks, but there is still a significant gap to our approach using soft constraint.
MountainCar  DoublePendulum  Hopper  Walker2d  HalfCheetah  Ant  
$\mathcal{S}$ / $\mathcal{A}$  ${\mathbb{R}}^{4}$ / $\{0,1\}$  ${\mathbb{R}}^{11}$ / ${\mathbb{R}}^{1}$  ${\mathbb{R}}^{11}$ / ${\mathbb{R}}^{3}$  ${\mathbb{R}}^{17}$ / ${\mathbb{R}}^{6}$  ${\mathbb{R}}^{17}$ / ${\mathbb{R}}^{6}$  ${\mathbb{R}}^{111}$ / ${\mathbb{R}}^{8}$ 
Setting / Demo  S1 / 81.25  S3 / 1488.28  S2 / 969.71  S2 / 1843.75  S2 / 2109.80  S2 / 1942.05 
PPO  0.74$\pm $9.61  302.77$\pm $37.09  17.09$\pm $13.54  1.54$\pm $5.75  978.84$\pm $665.61  2332.95$\pm $2193.85 
MMDIL  82.99$\pm $4.57  218.43$\pm $13.72  118.66$\pm $0.38  8.88$\pm $6.07  161.74$\pm $219.85  967.83$\pm $0.87 
Pretrain  83.35$\pm $6.32  8928.79$\pm $388.61  1356.47$\pm $470.43  2607.38$\pm $301.94  3831.96$\pm $150.30  5377.25$\pm $1682.56 
POfD  45.01$\pm $28.16  628.47$\pm $69.36  32.13$\pm $24.23  1.48$\pm $0.03  2801.59$\pm $66.03  68.59$\pm $19.17 
Penalty  120.29$\pm $48.30  1902.95$\pm $210.41  1225.03$\pm $296.52  286.23$\pm $12.46  1517.68$\pm $35.85  3711.12$\pm $794.97 
Penalty + Ann.  79.00$\pm $1.04  1671.78$\pm $108.80  1220.10$\pm $112.74  282.00$\pm $6.70  2592.94$\pm $870.04  116.89$\pm $88.01 
Ours  83.46$\mathrm{\pm}$1.42  9331.40$\mathrm{\pm}$5.95  2329.89$\mathrm{\pm}$125.85  3483.78$\mathrm{\pm}$269.59  4106.69$\mathrm{\pm}$95.47  2645.58$\mathrm{\pm}$118.55 
5 Experiments
For the experiments below, we aim at investigating the following questions:

1.
Under the same imperfect expert settings, can our method attains better performative results versus the counterparts that do not employ demonstrations as a soft constraint?

2.
How can the different settings of imperfect expert data, i.e. quality and amount, affect the performances of our method and baselines?

3.
What is the key ingredient in our method that introduces better empirical results?
To answer the first question, we evaluate our method against several baselines on six physical control benchmarks (?; ?), ranging from lowdimensional classical control to challenging highdimensional continuous robotic control tasks. Regarding the second question, we conduct ablation analysis on the quality and amount of demonstrations, respectively. We test and contrast the performances of our method and two representative baselines (Pretrain (?) and POfD (?)) on these different imperfect expert setting. Finally, we explore another ablation analysis on the core component in our method, i.e. soft constraint to address the last question.
5.1 Settings
To simulate the sparse reward conditions using existing control tasks in Gym, we first propose several reward sparsification methods with details as follows^{2}^{2} 2 The presented results are still evaluated in the original exact reward defined in (?).:

•
S1: Agent receives reward $+1$ when it reaches a specific terminal state; otherwise, no reward will be provided.

•
S2: Agent receives reward $+1$ when has already moved towards a certain direction for some distance.

•
S3: Agent receives reward $+1$ when its last pole is higher than a given height. Only applied to DoublePendulum task.
We train expert policies (namely perfect experts, shown as Expert) for each tested tasks with PPO (?) based on the exact reward, and select policies learned meanwhile (namely imperfect experts, shown as Demo), record only one trajectory as the imperfect demonstrations. To make the comparisons fairer, the policies of all the methods and tasks are parameterized by the same neural network architecture with two hidden layers (300 and 400 units each) and tanh activation functions. All the algorithms are evaluated within the fixed amount of environment steps. And for every single task, we run each algorithm over five times with different random seeds.
5.2 Comparative Evaluations
In comparative evaluations, we carry out several RLfD baselines, including Pretrain (?) and POfD (?). In particular, we introduce another two baselines of penalty method^{3}^{3} 3 POfD also belongs to penalty method. with MMD as discrepancy measure, denoted by Penalty and Penalty + Ann., and the later one also employ an annealing strategy described in Sec. 4. We also run two nonRLfD baselines PPO and MMDImitation (denoted as MMDIL) to verify the reward sparsification and the imperfect expert setting respectively. PPO will run with the sparse reward while MMDIL will directly optimize the objective defined in (9) with provided imperfect demonstrations. In Figure 2, the solid curves correspond to the mean reward, and the shaded region represents the variance over five times. The results of our comparative evaluations are summarized in Table 1, which averaged 50 trials under the learned policies.
The results overall read that our method achieves comparable performances with the baselines on relatively simple tasks (such as MountainCar) and outperforms them with a large margin on difficult tasks (such as Hopper, Walker2d and Ant). During policy optimization, our method can converge faster than other RLfD counterparts as well as obtains better final results. Comparing with the strong baseline of Pretrain, we can see that although convergence efficiency of proposed method during the early phase of training may not have significant advantages, but as it continues, the performance of our method can be improved persistently like Hopper(+973.42) and Walker2d(+876.40), while Pretrain struggles on achieving higher return, which demonstrates that our method could benefit more from the exploration guidance offered by the soft constraint during the whole policy optimization procedure than by only imitating at the beginning.
On the other hand, we also find that our algorithm exhibits a more stable and efficient behavior over all the baselines using the penalty method. From the learning curve and numerical results, it can be seen that adopting penalty with imperfect demonstrations will induce a noisy and misleading gradient update, which will prevent the performances from improving further while our method with a soft constraint will not suffer from this. This essentially accounts for the performance gap between our method and all baselines with penalty departures. Moreover, the complex training strategies and auxiliary model in POfD also leads to unstable and inefficient training across different tasks and environment specifications.
From the results of PPO and MMDImitation, the experiment settings of reward sparsification and imperfect demonstrations can be verified. As it illustrates, under sparse environmental feedback, pure PPO fails to find an optimal policy on most of the tested tasks, which indicates the impact of ineffective exploration. While with few imperfect demonstrations, MMDImitation also cannot learn promising policies. It suggests that combining the demonstrations and environmental feedback would be essential for the designated tasks. Furthermore, as similar MMDImitation update may happen in our method when the optimization just starts (mentioned in Sec. 3.3), these results also show how can our method benefit from the followup solving of the constrained optimization problem.
5.3 Ablation Analysis I: Sensitivity to Demonstrations
The results presented in the previous section suggest that our proposed method outperforms other RLfD approaches on several challenging tasks. We’re now interested in whether these advantages still hold when the demonstration setting changes. We will compare our method and baselines on demonstrations with different amounts and quality respectively to show how can they affect the performative results.
Demonstrations with different amounts. We select six groups of demonstrations with different amounts from 50 to 5000 for comparison on the HalfCheetah task. Notice the comparative experiments in Sec.5.2 are conducted with one trajectory with 1000 stateaction pairs as demonstrations. The corresponding results are plotted in the Left of Figure 3. The results read that our method performs advantageously than the baselines on these demonstration settings, and the performance gap is also getting larger as the number of demonstrations increases. On the other hand, the results could benefit from more demonstrations in a certain range for all the methods, while our method can be more robust when the demonstrations become fewer.
Demonstrations with different qualities. We emulate the demonstrations of different qualities by mixing the demonstrated data from perfect (Expert) and imperfect (Demo) policies with different ratios. The Right of Figure 3 presents the results of our method and baselines with these demonstrations. It implies that the quality of demonstrations will significantly affect the performances of all the evaluated methods, and expert data with high quality can facilitate policy optimization to a certain extent. We can also see that our method overall outperforms the two counterparts even though the expert data becomes perfect (by setting the ratio to 1.00), indicating that our constraintbased method can exploit the expert data more efficiently than other methods based on penalty departures or pretraining.
5.4 Ablation Analysis II: Sensitivity to Constraint Tolerance
Now we will further investigate how can the design of the core soft constraint affect the performative results of our method. More specifically, we’re interested in the tolerance factor $d$. By varying the initial value of $d$ and annealing strategies (namely, different annealing factor $\u03f5$), we will explore the sensitivity of our algorithm regarding them.
Different tolerance. We design four groups of parameters for the ablation experiments on the tolerance choosing in HalfCheetah task, where the annealing mechanism is disabled by setting $\u03f5$ fixed at zero, and choose initial tolerance ${d}_{0}$ from $\{{10}^{0},{10}^{1},{10}^{3},{10}^{6}\}$. The learning curves are plotted in Left of Figure 4. As the results demonstrate, when given relatively large tolerance, the exploration reference from demonstrations will not work as the constraint almost does not affect policy optimization. In contrast, a toosmall tolerance will hurt the final performance when the demonstrations are imperfect. Therefore, handcrafting the tolerance for the constraint can be difficult, and an automatic adjustment with the annealing mechanism should be adopted.
Fixed vs. Annealing tolerance. In the previous experiment, we mention the importance of annealing of tolerance. Now we explore the advantages of annealing mechanism quantitatively in HalfCheetah task. Since our annealing is to enlarge the tolerance along training, we only choose two nottoolarge initial tolerances ${d}_{0}$ from $\{{10}^{3},{10}^{6}\}$, and select the annealing factor $\u03f5$ from $\{0,2\times {10}^{3},{10}^{3},{10}^{6}\}$. Corresponding learning curves are shown in Right of Figure 4. We can see that the performances of our method with an annealing tolerance are overall better than with a fixed one (simply by setting $\u03f5$ as zero). Moreover, when the annealing factor $\u03f5$ is set properly, the performance of our method is not sensitive to the minor changes of $\u03f5$ as the results of different factors are almost at the same level. This further demonstrates the robustness of our proposed method.
6 Conclusion
In this paper, we investigate the problem of RLfD with imperfect expert data. Compared to existing RLfD problem setting, this new setting does not require the expert to be optimal, which can be more practical for realworld demonstrators like a human. We show that current penalty based RLfD methods will suffer from the issue of optimality and convergence when being applied to the setting of imperfect experts both theoretically and empirically. To this end, we propose to employs the expert data as a soft constraint and reformulate RLfD as a constrained policy optimization problem to narrow the negative impact of the imperfectness. We also provide an efficient learning algorithm for solving the challenging constrained optimization problem with scalable policy model like neural networks. Experiments on physical control benchmarks demonstrate the effectiveness of our proposed method over other RLfD counterparts. While we still assume the expert data to be collected from the same domain as the current conducted task, further exploration on combining our work with representation learning to enable learning with demonstrations across different domains could be a new direction of future work.
Acknowledgment
This research was funded by National Science Foundation of China (Grant No.91848206). It was also partially supported by the National Science Foundation of China (NSFC) and the German Research Foundation (DFG) in project Cross Modal Learning, NSFC 61621136008/DFG TRR169. We would like to thank Dr. Boqing Gong and Dr. Tao Kong for their generous help and insightful advice.