Abstract
Although reinforcement learning (RL) can provide reliable solutions in manysettings, practitioners are often wary of the discrepancies between the RLsolution and their status quo procedures. Therefore, they may be reluctant toadapt to the novel way of executing tasks proposed by RL. On the other hand,many realworld problems require relatively small adjustments from the statusquo policies to achieve improved performance. Therefore, we propose astudentteacher RL mechanism in which the RL (the "student") learns to maximizeits reward, subject to a constraint that bounds the difference between the RLpolicy and the "teacher" policy. The teacher can be another RL policy (e.g.,trained under a slightly different setting), the status quo policy, or anyother exogenous policy. We formulate this problem using a stochasticoptimization model and solve it using a primaldual policy gradient algorithm.We prove that the policy is asymptotically optimal. However, a naiveimplementation suffers from high variance and convergence to a stochasticoptimal policy. With a few practical adjustments to address these issues, ournumerical experiments confirm the effectiveness of our proposed method inmultiple GridWorld scenarios.
Quick Read (beta)
Don’t Forget Your Teacher:
A Corrective Reinforcement Learning Framework
Abstract
Although reinforcement learning (RL) can provide reliable solutions in many settings, practitioners are often wary of the discrepancies between the RL solution and their status quo procedures. Therefore, they may be reluctant to adapt to the novel way of executing tasks proposed by RL. On the other hand, many realworld problems require relatively small adjustments from the status quo policies to achieve improved performance. Therefore, we propose a studentteacher RL mechanism in which the RL (the “student”) learns to maximize its reward, subject to a constraint that bounds the difference between the RL policy and the “teacher” policy. The teacher can be another RL policy (e.g., trained under a slightly different setting), the status quo policy, or any other exogenous policy. We formulate this problem using a stochastic optimization model and solve it using a primaldual policy gradient algorithm. We prove that the policy is asymptotically optimal. However, a naive implementation suffers from high variance and convergence to a stochastic optimal policy. With a few practical adjustments to address these issues, our numerical experiments confirm the effectiveness of our proposed method in multiple GridWorld scenarios.
shapes,arrows, decorations.markings \definecolormyGreenRGB0,175,0
Don’t Forget Your Teacher:
A Corrective Reinforcement Learning Framework
Mohammadreza Nazari [email protected] Majid Jahani [email protected] Lawrence V. Snyder [email protected] Martin Takáč [email protected] Department of Industrial and Systems Engineering Lehigh University, Bethlehem, PA 18015
noticebox[b]Preprint. Under review.\[email protected]
1 Introduction
We encourage using a new paradigm called corrective RL, in which a reinforcement learning (RL) agent is trained to maximize its reward while not straying “too far” from a previously defined policy. The motivation is twofold: (1) to provide a gentler transition for a decisionmaker who is accustomed to using a certain policy but now considers implementing RL, and (2) to develop a framework for gently transitioning from one RL solution to another when the underlying environment has changed.
RL has recently achieved considerable success in artificially created environments, such as Atari games Mnih et al. (2013, 2016) or robotic simulators Lillicrap et al. (2015). Exploiting the power of neural networks in RL algorithms has been shown to exhibit superhuman performance by enabling automatic feature extraction and policy representations, but realworld applications are still very limited, conceivably due to lack of representativity of the optimized policies. Over the past few years, a major portion of the RL literature has been developed for RL agents with no prior information about how to do a task. Typically, these algorithms start with random actions and learn while interacting with the environment through trial and error. However, in many practical settings, prior information about good solutions is available, whether from a previous RL algorithm or a human decisionmaker’s prior experience. Our approach trains the RL agent to make use of this prior information when optimizing, in order to avoid deviating too far from a target policy.
Although toy environments and Atari games are prevalent in the RL literature due to their simplicity, RL has recently been trying to find its path to realworld applications such as recommender systems Chen et al. (2018), transportation Nazari et al. (2018), Internet of Things Feng et al. (2017), supply chain Gijsbrechts et al. (2018); Oroojlooyjadid et al. (2017) and various control tasks in robotics Gu et al. (2017). In all of these applications, there is a crucial risk that the new policy might not operate logically or safely, as one was expecting it to do. A policy that attains a large reward but deviates too much from a known policy—which follows logical steps and processes—is not desirable for these tasks. For example, users of a system who were accustomed to the old way of doing things would likely find it hard to switch to a newly discovered policy, especially if the benefit of the new policy is not obvious or immediately forthcoming. Indeed, we argue that many realworld tasks only need a small corrective fix to the currently running policies to achieve their desired goals, instead of designing everything from scratch. Throughout this paper, we adhere to this paradigm—we call it “corrective RL”—which utilizes an acceptable policy as a gauge when designing novel policies. We consider two agents, namely a teacher and a student. Our main question is how the student can improve upon the teacher’s policy while not deviating too far from it. More formally, we would like to train a student in a way that it maximizes a longterm RL objective while keeping its own policy close to that of the teacher.
For example, consider an airplane that is controlled by an autopilot that follows the shortest haversine path policy towards the destination. Then, some turbulence occurs, and we want to modify the current path to avoid the turbulence. A “pure” RL algorithm would reoptimize the trajectory from scratch, potentially deviating too far from the optimal path in order to avoid the turbulence. Corrective RL would ensure that the adjustments to the current policy are small, so that the flight follows a similar path and has a similar estimated time of arrival, while ensuring that the passengers experience a more comfortable (less turbulent) flight. Another example is in predictive maintenance, where devices are periodically inspected for possible failures. Inspection schedules are usually prescribed by the device designers, but many environmental conditions affect failure rates, hence there is no guarantee that factory schedules are perfect. If the objective is to reduce downtime with only slight adjustments to the current schedules, conventional RL algorithms would have a hard time finding such policies.
Similar concerns arise in other business and engineering domains as well, including supply chain management, queuing systems, finance, and robotics. For example, an enormous number of exact and inexact methods have been proposed for classical inventory management problems under some assumptions on the demand, e.g., that it follows a normal distribution Snyder and Shen (2019). Once we add more stochasticity to the demand distribution or consider more complicated cost functions, these problems often become intractable using classical methods. Of course, vanilla RL can provide a mechanism for solving more complex inventory optimization problems, but practitioners may prefer policies that are simple to implement and intuitive to understand. Corrective RL can help steer the policy toward the preferred ones, while still maintaining nearoptimal performance. Given these examples, one can interpret our approach as an improvement on blackbox heuristics, which uses a datadriven approach to improve the performance of these algorithms without dramatic reformulations.
The contributions of this work are as follows: i) we introduce a new paradigm for RL tasks, convenient for many realworld tasks, that improves upon the currently running system’s policy with a small perturbation of the policy; ii) we formulate this problem using an stochastic optimization problem and propose a primal–dual policy gradient algorithm which we prove to be asymptotically optimal, and iii) using practical adjustments, we illustrate how an RL framework can act as an improvement heuristic. We show the effectiveness and properties of the algorithm in multiple GridWorld motion planning experiments.
2 Problem Definition
We consider the standard definition of a Markov decision process (MDP) using a tuple $(\mathcal{X},\mathcal{A},C,P,{P}_{0})$. In our notation, $\mathcal{X}\u2254{\mathcal{X}}^{\prime}\cup \{{x}_{term}\}=\{1,2,\mathrm{\dots},n,{x}_{term}\}$ is the state space, where ${\mathcal{X}}^{\prime}$ is the set of transient states and ${x}_{term}$ is the terminal state; $\mathcal{A}$ is the set of actions; $C:\mathcal{X}\times \mathcal{A}\to [0,{C}_{max}]$ is the cost function; $P$ is the transition probability distribution; and ${P}_{0}$ is the distribution of the initial state ${x}_{0}$. At each time step $t=0,1,\mathrm{\dots}$, the agent observes ${x}_{t}\in \mathcal{X}$, selects ${a}_{t}\in \mathcal{A}$, and incurs a cost ${c}_{t}=C({x}_{t},{a}_{t})$. Selecting the action ${a}_{t}$ at state ${x}_{t}$ transitions the agent to the next state ${x}_{t+1}\sim P(\cdot {x}_{t},{a}_{t})$.
Consider two agents, a teacher and a student. The teacher has prior knowledge about the task and prescribes an action for any state that the student encounters, and the student has the authority to follow the teacher’s action or act independently based on its own learned policy. Let ${\pi}_{S}$ denote the policy of the student and ${\pi}_{T}$ be the policy of the teacher, where both ${\pi}_{S}$ and ${\pi}_{T}$ are stationary stochastic policies defined as a mapping from a state–action pair to a probability distribution, i.e., ${\pi}_{i}:\mathcal{X}\times \mathcal{A}\to [0,1]$, $i\in \{S,T\}$. For example, ${\pi}_{S}(ax)$ denotes the probability of choosing action $a$ in state $x$ by the student. In policy gradient methods, the policies are represented with a function approximator, usually modeled by a neural network, where we denote by $\theta \in {\mathbb{R}}^{N}$ and $\varphi \in {\mathbb{R}}^{N}$ the corresponding policy weights of the student and teacher, respectively; the teacher and student parameterized policies are denoted by ${\pi}_{T}(\cdot \cdot ;\varphi )$ and ${\pi}_{S}(\cdot \cdot ;\theta )$. In what follows, we adapt this parameterization structure into the notation and interchangeably refer to ${\pi}_{S}$ and ${\pi}_{T}$ with their associated weights, $\theta $ and $\varphi $.
We consider the simulation optimization setting, where we can sample from the underlying MDP and observe the costs. Consider a possible state–action–cost trajectory $\tau \u2254\{{x}_{0},{a}_{0},{c}_{0},{x}_{1},{a}_{1},{c}_{1},\mathrm{\cdots},{x}_{H1},{a}_{H1},{c}_{H1},{x}_{H}\}$ and let $\mathcal{T}\u2254\{\tau \}$ to be the set of all possible trajectories under all admissible policies. For simplicity of exposition, we assume that the first hitting time $H$ of a terminal state ${x}_{term}$ from any given $x$ and following a stationary policy is bounded almost surely with an upper bound ${H}_{max}$, i.e., $H\le {H}_{max}$ almost surely. Since the sample trajectories in many RL tasks terminate in finite time, this assumption is not restrictive. For example, the game fails after reaching a certain state or a timeout signal may terminate the trajectory. Along a trajectory $\tau $, the system incurs a discounted cost ${J}_{\theta}(\tau )={\sum}_{t=0}^{H1}{\gamma}^{t}{c}_{t}$, with discount factor $\gamma \in (0,1]$, and the probability of sampling such a trajectory is ${\mathbb{P}}_{\theta}(\tau )={P}_{0}({x}_{0}){\prod}_{t=0}^{H1}{\pi}_{S}({a}_{t}{x}_{t};\theta )P({x}_{t+1}{x}_{t},{a}_{t})$. We denote the expected cost from state $x$ onward until hitting the terminal state ${x}_{term}$ by ${V}_{\theta}(x)$, i.e.,
${V}_{\theta}(x)={\mathbb{E}}_{\theta}[{\sum}_{t=0}^{H1}{\gamma}^{t}C({x}_{t},{a}_{t}){x}_{0}=x].$  (1) 
2.1 Distance Measure
An important question that arises is how to quantify the distance between the policies of the teacher and the student. There are several distance measures studied in the literature for computing the closeness of two probability measures. Among those, the KullbackLeibler (KL) divergence Nasrabadi (2007) is a widely used metric. In this work, we consider both reverse (KLR) and forward (KLF) KLdivergence, defined as
${D}_{KL}(\theta \parallel \varphi )\u2254{D}_{KL}$  $({\mathbb{P}}_{\theta}(\tau )\parallel {\mathbb{P}}_{\varphi}(\tau ))={\sum}_{\tau \in \mathcal{T}}{\mathbb{P}}_{\theta}(\tau )\mathrm{log}\frac{{\mathbb{P}}_{\theta}(\tau )}{{\mathbb{P}}_{\varphi}(\tau )}$  (KLR)  
${D}_{KL}(\varphi \parallel \theta )\u2254{D}_{KL}$  $({\mathbb{P}}_{\varphi}(\tau )\parallel {\mathbb{P}}_{\theta}(\tau ))={\sum}_{\tau \in \mathcal{T}}{\mathbb{P}}_{\varphi}(\tau )\mathrm{log}\frac{{\mathbb{P}}_{\varphi}(\tau )}{{\mathbb{P}}_{\theta}(\tau )}.$  (KLF) 
KLdivergence is known to be an asymmetric distance measure, meaning that changing the order of the student and teacher distributions will cause different learning behaviors. We will use the reverse KLdivergence in the theoretical analysis since it provides more compact notations. However, in all of the experiments, we will consider the forward setting, i.e., ${D}_{KL}(\varphi \parallel \theta )$, unless otherwise specified. Informally speaking, this form of KLdivergence, which is also known as the meanseeking KL, allows the student to perform actions that are not included in the teacher’s behavior. This is because i) when the teacher can perform an action ${a}_{t}$ in a given state ${s}_{t}$, the student should also have ${\pi}_{S}({a}_{t}{s}_{t})>0$ to keep the distance finite, and ii) the student can have ${\pi}_{S}({a}_{t}{s}_{t})>0$ irrespective of whether the teacher is doing that action or not. The reverse direction, ${D}_{KL}(\theta \parallel \varphi )$, known as the modeseeking KL, can be useful as well. For example, let’s assume that the teacher policy is a mixture of several Gaussian subpolicies. Using the reverse order will allow the student to assign only one subpolicy as its decision making policy. Hence, the choice of reverse KL would be preferred if the student wants to find a policy which is close to a teacher’s subpolicy with the highest return. The justification of this behavior is also visible from the definition: when ${\pi}_{S}({a}_{t}{s}_{t})>0$ for a given state and action, then the teacher also should have ${\pi}_{T}({a}_{t}{s}_{t})>0$. Also, ${\pi}_{T}({a}_{t}{s}_{t})=0$ would not allow ${\pi}_{S}({a}_{t}{s}_{t})>0$. For more detailed discussion and examples, we refer the interested reader to Appendix C.1 and Section 10.1.2 of Nasrabadi (2007).
2.2 Optimization Problems
The student’s optimization problems that we would like to solve for reverse KLdivergence (OPTR) and the forward KLdivergence (OPTF) are defined as
$\begin{array}{cc}\hfill \underset{\theta \in \mathrm{\Theta}}{\mathrm{min}}& {V}_{\theta}({x}_{0})\hfill \\ \hfill \text{s.t.}& {D}_{KL}(\theta \parallel \varphi )\le \delta \hfill \end{array}$  (OPTR) 
$\begin{array}{cc}\hfill \underset{\theta \in \mathrm{\Theta}}{\mathrm{min}}& {V}_{\theta}({x}_{0})\hfill \\ \hfill \text{s.t.}& {D}_{KL}(\varphi \parallel \theta )\le \delta ,\hfill \end{array}$  (OPTF) 
where $\delta $ is an upper bound on the KLdivergence and $\mathrm{\Theta}$ is a convex compact set of possible policy parameters. Most of the theoretical analysis of the two optimization problems is quite similar, so we will use (OPTR) as our main formulation. In Appendix B.3, we investigate the equivalence of both problems and state their minor differences.
The widely adopted problem studied for MDPs only contains the objective function; however, we impose an additional constraint to restrict the student’s policy. By fixing an appropriate value for $\delta $, one can enforce a constraint on the maximum allowed deviation of the student policy from that of the teacher. The objective is to find a set of optimal points ${\theta}^{*}$ that minimizes the discounted expected cost while not violating the KL constraint. Notice that ${\pi}_{S}={\pi}_{T}$ is a trivial feasible solution. In addition, we need to have the following assumption to ensure that (OPTR) is welldefined:
Assumption 1.
(Welldefined (OPTR)) For any state–action pair $\mathrm{(}\mathrm{x}\mathrm{,}\mathrm{a}\mathrm{)}\mathrm{\in}\mathrm{X}\mathrm{\times}\mathrm{A}$ with ${\mathrm{\pi}}_{\mathrm{T}}\mathrm{}\mathrm{(}\mathrm{x}\mathrm{,}\mathrm{a}\mathrm{)}\mathrm{=}\mathrm{0}$, we have ${\mathrm{\pi}}_{\mathrm{S}}\mathrm{}\mathrm{(}\mathrm{x}\mathrm{,}\mathrm{a}\mathrm{)}\mathrm{=}\mathrm{0}$.
Intuitively, Assumption 1 specifies that when the teacher does not take a specific action in a given state, the student also cannot choose that action. Even though this assumption might seem restrictive, it is valid in situations in which the student is indeed limited to the positiveprobability action space of the teacher. Alternatively, we can certify this assumption by adding a small noise term to the outcome of the teacher’s policy at the expense of some information loss.
2.3 Lagrangian Relaxation of (OPTR)
The standard method for solving (OPTR) is by applying Lagrangian relaxation Bertsekas (1999). We define the Lagrangian function
$L(\theta ,\lambda )\u2254{V}_{\theta}({x}_{0})+\lambda \left({D}_{\text{KL}}(\theta \parallel \varphi )\delta \right),$  (2) 
where $\lambda $ is the Lagrange multiplier. Then the optimization problem (OPTR) can be converted into the following problem:
${\mathrm{max}}_{\lambda \ge 0}{\mathrm{min}}_{\theta \in \mathrm{\Theta}}L(\theta ,\lambda ).$  (3) 
The intuition beyond (3) is that we now allow the student to deviate arbitrarily much from the teacher’s policy in order to decrease the cumulative cost, but we penalize any such deviation.
Next, we define a dynamical system which, as we will prove in Appendix B, solves problem (OPTR) under several common assumptions for stochastic approximation methods. Once we know the optimal Lagrange multiplier ${\lambda}^{*}$, then the student’s optimal policy is
$${\theta}^{*}\in {argmin}_{\theta}{V}_{\theta}({x}_{0})+{\lambda}^{*}\left({D}_{\text{KL}}(\theta \parallel \varphi )\delta \right).$$  (4) 
A point $({\theta}^{*},{\lambda}^{*})$ is a saddle point of $L(\theta ,\lambda )$ if for some $r>0$, we have
$L({\theta}^{*},\lambda )\le L({\theta}^{*},{\lambda}^{*})\le L(\theta ,{\lambda}^{*})$  (5) 
for all $\theta \in \mathrm{\Theta}\cap {\mathbb{B}}_{r}({\theta}^{*})$ and $\lambda \ge 0$, where ${\mathbb{B}}_{r}({\theta}^{*})$ represents a ball around ${\theta}^{*}$ with radius $r$. Then, the saddle point theorem Bertsekas (1999) immediately implies that ${\theta}^{*}$ is the local optimal solution of (OPTR).
3 Primal–Dual Policy Gradient Algorithm
We propose a primal–dual policy gradient (PDPG) algorithm for solving (OPTR). Due to space limitation, we leave the detailed algorithm to Appendix A.2, but the overall scheme is as follows. After initializing the student’s policy parameters, possibly with those of the teacher, we sample multiple trajectories under the student’s policy at each iteration $k$. Then the sampled trajectories are used to calculate the approximate gradient of the Lagrangian function with respect to $\theta $ and $\lambda $. Finally, using an optimization algorithm, we update $\theta $ and $\lambda $ according to the approximated gradients.
In order to prove our main convergence result, we need some technical assumptions on the student’s policy and step sizes.
Assumption 2.
(Smooth policy) For any $\mathrm{(}\mathrm{x}\mathrm{,}\mathrm{a}\mathrm{)}\mathrm{\in}\mathrm{X}\mathrm{\times}\mathrm{A}$, ${\mathrm{\pi}}_{\mathrm{S}}\mathrm{}\mathrm{(}\mathrm{a}\mathrm{}\mathrm{x}\mathrm{;}\mathrm{\theta}\mathrm{)}$ is a continuously differentiable function in $\mathrm{\theta}$ and its gradient is $\mathrm{L}$Lipschitz continuous, i.e., for any ${\mathrm{\theta}}^{\mathrm{1}}$ and ${\mathrm{\theta}}^{\mathrm{2}}$,
${\parallel {\nabla}_{\theta}{\pi}_{S}(ax;\theta )}_{\theta ={\theta}^{1}}{\nabla}_{\theta}{\pi}_{S}(ax;\theta ){}_{\theta ={\theta}^{2}}\parallel \le \mathcal{L}\parallel \theta {}^{1}\theta {}^{2}\parallel .$  (6) 
Assumption 3.
Relations 1 and 2 in Assumption 3 are common in stochastic approximation algorithms, and 3 indicates that the Lagrange multiplier update is in a slower timescale compared to the policy updates. The latter condition simplifies the convergence proof by allowing us to study the PDPG as a twotimescale stochastic approximation algorithm. The following theorem states the main theoretical result of this paper.
Theorem 1.
Under Assumptions 1, 2, and 3, the sequence of policy updates (starting from ${\theta}^{\mathrm{0}}$ sufficiently close to a local optimum point ${\theta}^{\mathrm{*}}$) and Lagrange multipliers converges almost surely to a saddle point of the Lagrangian, i.e., $\mathrm{(}\theta \mathrm{(}k\mathrm{)}\mathrm{,}\lambda \mathrm{(}k\mathrm{)}\mathrm{)}\stackrel{a\mathrm{.}s\mathrm{.}}{\mathrm{\u27f6}}\mathrm{(}{\theta}^{\mathrm{*}}\mathrm{,}{\lambda}^{\mathrm{*}}$). Moreover, ${\theta}^{\mathrm{*}}$ is a local optimal solution of (OPTR).
Proof.
(sketch) The proof is similar to those found in Tamar et al. (2012); Chow et al. (2017). It is based on representing $\theta $ and $\lambda $ update rules with a twotimescale stochastic approximation algorithm. For each timescale, the algorithm can be shown to converge to the stationary points of the corresponding continuoustime system. Finally, it can be shown that the fixed point is, in fact, a locally optimal point. In Appendix B.1, we provide a formal proof of this theorem. ∎
Corollary 1.
Under Assumptions 1, 2, and 3, the sequence of policy updates and Lagrange multipliers converges globally to a stationary point of the Lagrangian almost surely. Moreover, if ${\theta}^{\mathrm{*}}$ is in the interior of $\mathrm{\Theta}$, then ${\theta}^{\mathrm{*}}$ is a feasible first order stationary point of (OPTR), i.e., ${{\mathrm{\nabla}}_{\theta}\mathit{}{V}_{\theta}\mathit{}\mathrm{(}{x}_{\mathrm{0}}\mathrm{)}\mathrm{}}_{\theta \mathrm{=}{\theta}^{\mathrm{*}}}\mathrm{=}\mathrm{0}$ and ${D}_{K\mathit{}L}\mathit{}\mathrm{(}{\theta}^{\mathrm{*}}\mathrm{\parallel}\varphi \mathrm{)}\mathrm{\le}\delta $.
4 Practical PDPG Algorithm
Although the algorithm presented in the previous section is proved to converge to a firstorder stationary point, it cannot directly serve as a practical learner algorithm. The main reason is that it produces a highvariance approximation of the gradients, which would lead to unstable learning. In this section, we propose several approximations to the theoreticallyjustified PDPG in order to develop a more practical algorithm. For this algorithm, we will consider the forward definition of KLdivergence due to the meancovering property.
One source of variance is the reward bias, which can be handled by adding a critic, similar to Konda and Tsitsiklis (2000). Our next adjustment is to use an approximation of the stepwise KLdivergence, defined as
${\widehat{D}}_{KL}^{step}(\varphi \parallel \theta )=\frac{1}{N}{\sum}_{j=1}^{N}{\sum}_{t=0}^{H}{D}_{KL}^{step}({\pi}_{T}(\cdot {x}_{t};\varphi )\parallel {\pi}_{S}(\cdot {x}_{t};\theta ))$  (7) 
where ${D}_{KL}^{step}({\pi}_{T}(\cdot {x}_{t};\varphi )\parallel {\pi}_{S}(\cdot {x}_{t};\theta ))={\sum}_{a\in \mathcal{A}}{\pi}_{T}(a{x}_{t};\varphi )\mathrm{log}\frac{{\pi}_{T}(a{x}_{t};\varphi )}{{\pi}_{S}(a{x}_{t};\theta )}.$ As we discuss in Appendix C.1, using (7) results in a much smaller variance, while still ensuring the convergence results. Intuitively, this equation suggests that instead of computing the trajectory probabilities and then computing the KLdivergence, as in (KLR), one can compute the KL in every visited state along a trajectory and sum them up. In addition to this change, we will further normalize each ${D}_{KL}^{step}$ by its trajectory length $H$ to remove the effect of the variable horizon length. The latter modification will lead to more sensible KL values and will make the choice of $\delta $ easier.
A second difficulty with the algorithm in Section 3 is that, unlike conventional policy gradient algorithms, there is no guarantee that the student’s optimal policy is a deterministic one. In fact, in most of our experiments, it happens that the optimal policy is stochastic, especially when the teacher’s policy itself is stochastic. To illustrate this, consider two scenarios: i) The student refuses to do the suggested action of a deterministic teacher. In this case, she would incur an infinite cost as a result of her disobedience, so the problem will be infeasible. ii) The teacher is less informative and has no clue about most of the state space, so often takes random actions. Trying to emulate this teacher would cause degraded performance for the student as well, so the student would also take many less informed actions.
A stochastic optimal policy is usually not desirable since it poses major safety and reliability challenges, so our next adjustments are an attempt to address this issue. One possible mitigation for the first scenario might be using a bounded distance measure such as Hellinger Cramer (1946) instead of KLdivergence, but our numerical experiments did not confirm that this is effective. We observe that by using the Hellinger constraint, the total entropy of the student’s policy stays high, without any improvement in the student’s policy. Instead, we propose using percentile KLclipping, which is defined as
${\mathrm{\text{clip}}}_{\rho}\left({D}_{KL}^{step}\right)=\mathrm{max}\{\rho \text{\% percentile of all}{D}_{KL}^{step}\text{s at time}t,{D}_{KL}^{step}\}$  (8) 
In fact, the clip function enables the student to totally disagree with the teacher in $\rho $% of the visited states, without receiving an extremely large penalty. Selecting the values for $\rho $ depends on our perception about how perfect the teacher is. Setting $\rho $ close to 100 means that we believe in the teacher’s suggestions. As we decrease $\rho $, we rely less on the teacher and can disobey more freely.
The last major modification is to control the expected entropy at a certain small level ${\delta}^{ent}>0$, i.e.,
$ent(\theta )\u2254{\mathbb{E}}_{x}\left[\frac{1}{H}{\sum}_{t=0}^{H}{\sum}_{a\in \mathcal{A}}{\pi}_{S}(ax;\theta )\mathrm{log}{\pi}_{S}(ax;\theta )\right]={\delta}^{ent}.$  (9) 
The justification for adding (9) is that we would like the optimal policy to be close to a deterministic one as much as possible. By setting a small value for ${\delta}^{ent}$, we can enforce this property. Also, this constraint tries to avoid having a deterministic policy in intermediate training steps, in order to allow more exploration. To add this constraint, we use the same Lagrangian technique, adding an extra term to the Lagrangian function:
$L(\theta ,\lambda ,\zeta )\u2254{V}_{\theta}({x}_{0})+\lambda \left({D}_{\text{KL}}(\varphi \parallel \theta )\delta \right)+\zeta \left(ent(\theta ){\delta}^{ent}\right).$  (10) 
All of these modifications, along with a few others, are summarized in Algorithm LABEL:alg:ppg of Appendix LABEL:app:ppg.
5 Experiments
We illustrate the efficiency of the proposed methods with multiple GridWorld experiments. In the first set of experiments, the teacher tries to teach the student to perform an oscillating maneuver around the walls. In the second set, we study how the student can comprehend changes in the environment change and utilize them to increase its rewards.
5.1 SquareWave Teacher
In this experiment, we consider a teacher who gives a suggestion in every state of a GridWorld. We study two variants of the teacher, one who is very determined about all of his suggestions and the other who is less confident. Figure 1 illustrates the environment and both teachers’ suggestions. A student wants to find a path from the blue state to the green target. Each step has a reward $1$ and reaching the target brings $+100$ reward. If the student wants to act independently, the optimal path is a trivial horizontal line. However, our objective is to force the student to “listen” to her teacher up to some level.
Determined Teacher: In this part, a teacher has a preferred action in every state with a probability of around 98%. As we observe in Figure 0(a), these suggestions might help the student in reaching the target, but they are inefficient. For instance, if the student follows a squarewave sequence of actions as illustrated by the red line, she will be able to reach the target while following all of the teacher’s suggestions exactly. Our objective is to allow the student deviate from the teacher for a few steps, to find shorter routes.
By using PDPG, the student is able to find policies that are a mixture of the horizontal path and the squarewave route. For example, in Figure 0(c), we have illustrated an instance of the student’s optimized path with $\delta =0.2$ . The extent to which either policy is followed depends on values of $\delta $ and the KLclipping parameter $\rho $. Figure 1(a) illustrates the student’s total reward for different $\delta $ quantities without KLclipping. We observe that as we increase $\delta $, we allow the student to act more freely, hence she gets a higher reward. However, after 5000 training iterations, the reward remains at the same level with too much oscillation. Recalling the discussions of Section 4, this behavior indicates convergence to a stochastic policy.
To reduce the oscillating behavior, we proposed adding an entropy constraint and KLclipping. Figure 1(b) shows how adding the entropy constraint results in a more deterministic (i.e., lower entropy) policy. Also, in Figure 1(c), we have added KLclipping. As we decrease $\rho $, the student can totally disagree with the teacher in a larger proportion of the visited states, so she can find better policies with higher rewards. For different values of $\rho $, we see that the policy can converge to either a stochastic or deterministic one. For $\rho =70,75$, it converges to a deterministic horizontal line policy. With $\rho =80,85$, it learns to deterministically follow one $\bigsqcup $shaped path followed by a horizontal route, and for $\rho =95$, it follows a $\bigsqcup \sqcap \mathit{\hspace{1em}}\bigsqcup $shaped path with a horizontal line at the end. Notice that even with $\rho =100$, which means no clipping, the student is not exactly following the teacher. We also observe that for $\rho =90$, it fails to converge to a deterministic policy. One justification for such a failure is that the student’s policy is far better than the lessrewarding deterministic one, but not good enough to get to the next level of performance. Finally, Figure 1(d) shows how the $\lambda $ and $\zeta $ values converge to their optimal values.
Less Confident Teacher: This experiment is designed to illustrate how a less confident teacher can still teach the student to follow some of his suggestions, but it will yield a lower level of confidence of the student. Figure 0(b) shows the suggested actions of the teacher; he is deterministic only in a subset of the states. For the rest, he does not have any information, so he suggests actions uniformly at random. The less confident teacher still has the square wave as the general idea (which is bad, just like the determined teacher), but also has extra randomness that points the student in even worse directions. In other words, the less confident teacher has a worse policy overall than the determined one. Recommending random actions causes the student to have more volatile behavior. We can observe this fact by comparing Figure 2(a) with 1(a), where the student’s converged policy produces a wider range of rewards for the lessconfident teacher’s case. Also, the average reward for this case is slightly lower, which can be explained by the inadequate information that the lessconfident teacher provides for solving the task.
Figure 2(b) shows that adding KLclipping helps in reducing the volatility, but one needs to choose a much smaller value for $\rho $ (compare it with Figure 1(c)). Yet, even a small $\rho $ does not necessarily result in a deterministic policy; for $\rho $ as small as $0.4$, the student has converged to a stochastic one.
5.2 Wall Leaping
The purpose of this experiment is to show that PDPG can act as an improvement method when the student encounters a slightly modified environment. The teacher’s reward structure is similar to the structure in the previous experiment, i.e., $1$ for every step and $+100$ for reaching the target. However, the student comprehends that she can leap over some of the walls with a reward of $2$. We use a vanilla policy gradient algorithm to train the teacher, which provides paths like the one illustrated in Figure 3(a). If we allow the student to learn without any constraint, it will find the green path in Figure 4 with a KLdivergence of $\approx 0.89$. However, this is not what we are looking for since it is extremely different from the teacher. Instead, we use the PDPG algorithm to constrain the policy deviation with $\delta =0.3$. Using this parameter, the student learns to follow the purple path, with a KLdivergence of $\approx 0.23$.
6 Related Work
Learning from a teacher is a wellstudied problem in the literature on supervised learning Girshick et al. (2014) and imitation learning Schaal (1999); Thomaz et al. (2006). However, we are not aware of any work using a teacher to control specific behaviors of a student. The typical use case of a student–teacher framework in RL is in “policy compression,” where the objective is to train a student from a collection of welltrained RL policies. Policy distillation Rusu et al. (2015) and actor–mimic Parisotto et al. (2015) are two methods that distill the trained RL agents, in a supervised learning fashion, into a unified policy of the student. In contrast, we follow a completely distinct objective, where a student is continually interacting with an environment and it only uses the teacher’s signals as a guideline for shaping her policy.
Closest to ours, Schmitt et al. (2018) propose “kickstarting RL,” a method that uses the teacher’s information for better training. Incorporating the idea of populationbased training, they design a handcrafted decreasing schedule of Lagrange multipliers, $\{{\lambda}^{k}\}\to 0$. Nevertheless, the justification for such a schedule is not clearly visible. However, noticing that their problem is a special case of ours with $\delta =\mathrm{\infty}$, our findings confirm the credibility of their approach, i.e., our findings indicate that ${\lambda}^{*}={lim}_{k\to \mathrm{\infty}}{\lambda}_{min}^{k}=0$ according to strong duality. This observation also conforms with the experimental findings of Schmitt et al. (2018), and our theoretical results indicate that when there is no obligation on being similar to the teacher, the student is better off eventually operating independently. Similarly, their method only uses the teacher for faster learning.
Imposing certain constraints on the behavior of a policy is also a common problem in the context of “safe RL” Achiam et al. (2017); Leike et al. (2017); Chow et al. (2018). Typically, these problems look for policies that avoid hazardous states either during training or execution. Our problem is different in that we follow another type of constraint, yet similar methods might be applied. Using a domainspecific programming language instead of neural networks can be an alternative method to add interpretability Verma et al. (2018), but it lacks the numerous advantages inherent in endtoend and differentiable learning. In an alternative direction, it is also possible to manipulate the policy shape by introducing auxiliary tasks or reward shaping Jaderberg et al. (2016). Despite the simplicity of the latter approach, it has a very limited capability. For example, it is unclear how reward shaping can suggest directions similar to our squarewave teacher. In summary, we believe that our endtoend method, by implicitly adding interpretable components, can partially alleviate the concerns related to the RL policies.
7 Concluding Remarks
In this paper, we introduce a new paradigm called corrective RL, which allows a “student” agent to learn to optimize its own policy while also staying sufficiently close to the policy of a “teacher.” Our approach is motivated by the fact that practitioners may be reluctant to adopt the policies proposed by RL algorithms if they differ too much from the status quo. Even if the RL policy produces an impressive expected return, this may not be satisfactory evidence to switch the operation of a billiondollar company to a policy found by an RL. We believe that corrective RL provides a straightforward remedy by constraining how far the new policy can deviate from the old one or another desired, target policy. Doing so will help reduce the stresses of adopting a novel policy.
We believe that, with further extensions, corrective RL has the potential to address to some of RL’s interpretability challenges. Using more advanced optimization algorithms, studying different distance measures, considering continuousaction problems, and having multiple teachers represent fruitful avenues for future research.
Acknowledgement
This work was partially supported by the U.S. National Science Foundation, under award numbers NSF:CCF:1618717, NSF:CMMI:1663256 and NSF:CCF:1740796, and XSEDE IRI180020.
References
 Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
 Lillicrap et al. [2015] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Chen et al. [2018] ShiYong Chen, Yang Yu, Qing Da, Jun Tan, HaiKuan Huang, and HaiHong Tang. Stabilizing reinforcement learning in dynamic environment with application to online recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1187–1196. ACM, 2018.
 Nazari et al. [2018] MohammadReza Nazari, Afshin Oroojlooy, Lawrence Snyder, and Martin Takac. Reinforcement learning for solving the vehicle routing problem. In Advances in Neural Information Processing Systems, pages 9861–9871, 2018.
 Feng et al. [2017] Shuo Feng, Peyman Setoodeh, and Simon Haykin. Smart home: Cognitive interactive peoplecentric internet of things. IEEE Communications Magazine, 55(2):34–39, 2017.
 Gijsbrechts et al. [2018] Joren Gijsbrechts, Robert N Boute, Jan A Van Mieghem, and Dennis Zhang. Can deep reinforcement learning improve inventory management? performance and implementation of dual sourcingmode problems. Performance and Implementation of Dual SourcingMode Problems (December 17, 2018), 2018.
 Oroojlooyjadid et al. [2017] Afshin Oroojlooyjadid, MohammadReza Nazari, Lawrence Snyder, and Martin Takáč. A deep qnetwork for the beer game with partial information. arXiv preprint arXiv:1708.05924, 2017.
 Gu et al. [2017] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates. In IEEE International Conference on Robotics and Automation (ICRA), pages 3389–3396, 2017.
 Snyder and Shen [2019] Lawrence V Snyder and ZuoJun Max Shen. Fundamentals of supply chain theory, 2th Edition. John Wiley & Sons, 2019.
 Nasrabadi [2007] Nasser M Nasrabadi. Pattern recognition and machine learning. Journal of Electronic Imaging, 16(4):049901, 2007.
 Bertsekas [1999] Dimitri P Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.
 Tamar et al. [2012] Aviv Tamar, Dotan Di Castro, and Shie Mannor. Policy gradients with variance related risk criteria. In Proceedings of the twentyninth International Conference on Machine Learning, pages 387–396, 2012.
 Chow et al. [2017] Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Riskconstrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research, 18(167):1–167, 2017.
 Konda and Tsitsiklis [2000] Vijay R Konda and John N Tsitsiklis. Actorcritic algorithms. In Advances in Neural Information Processing Systems, pages 1008–1014, 2000.
 Cramer [1946] Harold Cramer. Mathematical methods of statistics, princeton univ. Press, Princeton, NJ, 1946.
 Girshick et al. [2014] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 580–587, 2014.
 Schaal [1999] Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in Cognitive Sciences, 3(6):233–242, 1999.
 Thomaz et al. [2006] Andrea Lockerd Thomaz, Cynthia Breazeal, et al. Reinforcement learning with human teachers: Evidence of feedback and guidance with implications for learning performance. In Aaai, volume 6, pages 1000–1005. Boston, MA, 2006.
 Rusu et al. [2015] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. arXiv preprint arXiv:1511.06295, 2015.
 Parisotto et al. [2015] Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Actormimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342, 2015.
 Schmitt et al. [2018] Simon Schmitt, Jonathan J Hudson, Augustin Zidek, Simon Osindero, Carl Doersch, Wojciech M Czarnecki, Joel Z Leibo, Heinrich Kuttler, Andrew Zisserman, Karen Simonyan, et al. Kickstarting deep reinforcement learning. arXiv preprint arXiv:1803.03835, 2018.
 Achiam et al. [2017] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 22–31. JMLR. org, 2017.
 Leike et al. [2017] Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg. Ai safety gridworlds. arXiv preprint arXiv:1711.09883, 2017.
 Chow et al. [2018] Yinlam Chow, Ofir Nachum, Edgar DuenezGuzman, and Mohammad Ghavamzadeh. A lyapunovbased approach to safe reinforcement learning. In Advances in Neural Information Processing Systems, pages 8092–8101, 2018.
 Verma et al. [2018] Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri. Programmatically interpretable reinforcement learning. arXiv preprint arXiv:1804.02477, 2018.
 Jaderberg et al. [2016] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
 Bhatnagar et al. [2009] Shalabh Bhatnagar, Richard S Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actorcritic algorithms. Automatica, 45(11), 2009.
 Borkar [2009] Vivek S Borkar. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009.
 Bertsekas [2009] Dimitri P Bertsekas. Convex optimization theory. Athena Scientific Belmont, 2009.
 Slotine et al. [1991] JeanJacques E Slotine, Weiping Li, et al. Applied nonlinear control, volume 199. Prentice hall Englewood Cliffs, NJ, 1991.
 Teh et al. [2017] Yee Teh, Victor Bapst, Wojciech M Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pages 4496–4506, 2017.
 Ghosh et al. [2017] Dibya Ghosh, Avi Singh, Aravind Rajeswaran, Vikash Kumar, and Sergey Levine. Divideandconquer reinforcement learning. arXiv preprint arXiv:1711.09874, 2017.
 Liu et al. [2018] Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinitehorizon offpolicy estimation. In Advances in Neural Information Processing Systems, pages 5361–5371, 2018.
 Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Bello et al. [2016] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016.
Appendix A PDPG Algorithms
A.1 Computing the Gradients
The Lagrangian function in the optimization problem (3) can be rewritten as
$L(\theta ,\lambda )=$  $\sum _{\tau \in \mathcal{T}}}{\mathbb{P}}_{\theta}(\tau )J(\tau )+\lambda {\displaystyle \sum _{\tau \in \mathcal{T}}}{\mathbb{P}}_{\theta}(\tau )\mathrm{log}{\displaystyle \frac{{\mathbb{P}}_{\theta}(\tau )}{{\mathbb{P}}_{\varphi}(\tau )}}\lambda \delta $  
$=$  $\sum _{\tau \in \mathcal{T}}}{\mathbb{P}}_{\theta}(\tau )\left(J(\tau )+\lambda \mathrm{log}{\displaystyle \frac{{\mathbb{P}}_{\theta}(\tau )}{{\mathbb{P}}_{\varphi}(\tau )}}\right)\lambda \delta .$  (11) 
Recall that $\mathcal{T}$ is the set of all trajectories under all admissible policies. By taking the gradient of $L(\theta ,\lambda )$ with respect to $\theta $, we have:
${\nabla}_{\theta}L(\theta ,\lambda )$  $={\displaystyle \sum _{\tau \in \mathcal{T}}}{\nabla}_{\theta}{\mathbb{P}}_{\theta}(\tau )\left(J(\tau )+\lambda \mathrm{log}{\displaystyle \frac{{\mathbb{P}}_{\theta}(\tau )}{{\mathbb{P}}_{\varphi}(\tau )}}\right)+{\mathbb{P}}_{\theta}(\tau )\left(\lambda {\nabla}_{\theta}\mathrm{log}{\mathbb{P}}_{\theta}(\tau )\right)$  
$={\displaystyle \sum _{\tau \in \mathcal{T}}}{\mathbb{P}}_{\theta}(\tau ){\nabla}_{\theta}\mathrm{log}{\mathbb{P}}_{\theta}(\tau )\left(J(\tau )+\lambda \mathrm{log}{\displaystyle \frac{{\mathbb{P}}_{\theta}(\tau )}{{\mathbb{P}}_{\varphi}(\tau )}}+\lambda \right)$  
$={\mathbb{E}}_{\mathcal{T}}\left[{\nabla}_{\theta}\mathrm{log}{\mathbb{P}}_{\theta}(\tau )\left(J(\tau )+\lambda \mathrm{log}{\displaystyle \frac{{\mathbb{P}}_{\theta}(\tau )}{{\mathbb{P}}_{\varphi}(\tau )}}+\lambda \right)\right],$  (12) 
and the term ${\nabla}_{\theta}\mathrm{log}{\mathbb{P}}_{\theta}(\tau )$ can be simplified as
${\nabla}_{\theta}\mathrm{log}{\mathbb{P}}_{\theta}(\tau )$  $={\nabla}_{\theta}\left(\mathrm{log}{P}_{0}({x}_{0})+{\displaystyle \sum _{t=0}^{H1}}\mathrm{log}P({x}_{t+1}{x}_{t},{a}_{t})+{\displaystyle \sum _{t=0}^{H1}}\mathrm{log}{\pi}_{S}({a}_{t}{x}_{t};\theta )\right)$  
$={\displaystyle \sum _{t=0}^{H1}}{\nabla}_{\theta}\mathrm{log}{\pi}_{S}({a}_{t}{x}_{t};\theta )$  
$={\displaystyle \sum _{t=0}^{H1}}{\displaystyle \frac{{\nabla}_{\theta}{\pi}_{S}({a}_{t}{x}_{t};\theta )}{{\pi}_{S}({a}_{t}{x}_{t};\theta )}}.$  (13) 
The gradient of $L(\theta ,\lambda )$ with respect to $\lambda $ is
${\nabla}_{\lambda}L(\theta ,\lambda )$  $={\displaystyle \sum _{\tau \in \mathcal{T}}}{\mathbb{P}}_{\theta}(\tau )\mathrm{log}{\displaystyle \frac{{\mathbb{P}}_{\theta}(\tau )}{{\mathbb{P}}_{\varphi}(\tau )}}\delta ={D}_{\text{KL}}({\mathbb{P}}_{\theta}(\tau )\parallel {\mathbb{P}}_{\varphi}(\tau ))\delta .$  (14) 
By using a set of sample trajectories $\{{\tau}_{j},j=1,\mathrm{\dots},N\}$ generated under the student policy, one can approximate the gradients (12) and (14) as
${\nabla}_{\theta}L(\theta ,\lambda )$  $\approx {\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}\left[{\nabla}_{\theta}\mathrm{log}{\mathbb{P}}_{\theta}({\tau}_{j})\left(J({\tau}_{j})+\lambda \mathrm{log}{\displaystyle \frac{{\mathbb{P}}_{\theta}({\tau}_{j})}{{\mathbb{P}}_{\varphi}({\tau}_{j})}}+\lambda \right)\right],$  
${\nabla}_{\lambda}L(\theta ,\lambda )$  $\approx {\widehat{D}}_{\text{KL}}(\theta \parallel \varphi )\delta ={\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}\mathrm{log}{\displaystyle \frac{{\mathbb{P}}_{\theta}(\tau )}{{\mathbb{P}}_{\varphi}(\tau )}}\delta ,$ 
which are the update rules that will be used later on, in (15) and (16).
A.2 PDPG Algorithm
Having derived the gradients of the Lagrangian (in Appendix A.1), we have all the necessary information for proposing our primal–dual policy gradient (PDPG) algorithm, which is described in Algorithm A.2.
[htbp] \[email protected]@algorithmic[1] \STATEinput: teacher’s policy with weights $\varphi $ \STATEinitialize: student’s policy with ${\theta}^{0}$, possibly equal to $\varphi $; initialize step size schedules ${\alpha}_{1}(\cdot )$ and ${\alpha}_{2}(\cdot )$ \WHILETRUE \FOR$k=0,1,\mathrm{\dots}$ \STATEfollowing policy ${\theta}^{k}$, generate a set of $N$ trajectories ${\mathcal{T}}^{k}=\{{\tau}_{j}^{k},j=1,2,\mathrm{\dots},N\}$, each starting from an initial state ${x}_{0}\sim {P}_{0}(\cdot )$ \STATE($\theta $update) update ${\theta}^{k}$ according to
${\theta}^{k+1}={\mathrm{\Gamma}}_{\mathrm{\Theta}}\left[{\theta}^{k}{\alpha}_{1}(k)\left({{\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}{\nabla}_{\theta}\mathrm{log}{\mathbb{P}}_{\theta}({\tau}_{j}^{k})}_{\theta ={\theta}^{k}}\left(J({\tau}_{j}^{k})+{\lambda}^{k}\mathrm{log}{\displaystyle \frac{{\mathbb{P}}_{\theta}({\tau}_{j}^{k})}{{\mathbb{P}}_{\varphi}({\tau}_{j}^{k})}}+{\lambda}^{k}\right)\right)\right]$  (15) 
($\lambda $update) update ${\lambda}^{k}$ according to
${\lambda}^{k+1}={\mathrm{\Gamma}}_{\mathrm{\Lambda}}\left[{\lambda}^{k}+{\alpha}_{2}(k)\left({\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}\mathrm{log}{\displaystyle \frac{{\mathbb{P}}_{\theta}({\tau}_{j}^{k})}{{\mathbb{P}}_{\varphi}({\tau}_{j}^{k})}}\delta \right)\right]$  (16) 
${\lambda}^{k}$ converges to ${\lambda}_{max}$ \STATE${\lambda}_{max}\leftarrow 2{\lambda}_{max}$ \ELSE \STATEreturn $\theta $ and $\lambda $; break
After initializing the student with the teacher’s policy, at each iteration $k$, we take a minibatch of sample trajectories under the student’s policy ${\theta}^{k}$. In step A.2, we use the sampled trajectories to compute an approximate gradient of the Lagrangian function with respect to $\theta $ and update the policy parameters in the negative direction of the approximate gradient with step size ${\alpha}_{1}(k)$. In addition to policy parameter updates, the dual variables are learned concurrently using the recursive formula
${\lambda}^{k+1}={\lambda}^{k}+{\alpha}_{2}(k)\left[{\widehat{D}}_{\text{KL}}(\theta \parallel \varphi )\delta \right],$  (17) 
where ${\alpha}_{2}(k)$ represents the associated stepsize rule.
In this algorithm, we need to use two projection operators to ensure the convergence of the algorithm. Specifically, ${\mathrm{\Gamma}}_{\mathrm{\Theta}}$ is an operator that projects $\theta $ to the closest point in $\mathrm{\Theta}$, i.e., ${\mathrm{\Gamma}}_{\mathrm{\Theta}}(\theta )={argmin}_{\widehat{\theta}\in \mathrm{\Theta}}{\parallel \theta \widehat{\theta}\parallel}^{2}$. Similarly, ${\mathrm{\Gamma}}_{\mathrm{\Lambda}}$ is an operator that maps $\lambda $ to the interval $\mathrm{\Lambda}\u2254[0,{\lambda}_{max}]$. Finally, in steps A.2–A.2, we check whether $\lambda $ has converged to some point on the boundary. Such a convergence means that the projection space for the Lagrange multipliers is small, so we increment the upper bound and repeat searching for a better policy.
Appendix B Convergence Analysis of PDPG for (OPTR)
Before starting the proof of Theorem 1, noting the definition of ${\nabla}_{\theta}L(\theta ,\lambda )$ and ${\nabla}_{\lambda}L(\theta ,\lambda )$, one can make the following observations:
Lemma 1.
Under Assumption 2, the following holds:

1.
${\nabla}_{\theta}\mathrm{log}{\mathbb{P}}_{\theta}(\tau )$ is Lipschitz continuous in $\theta $, which further implies that
${\parallel {\nabla}_{\theta}\mathrm{log}{\mathbb{P}}_{\theta}(\tau )\parallel}^{2}\le {\kappa}_{1}(\tau )\left(1+{\parallel \theta \parallel}^{2}\right)$ (18) for some $$.

2.
${\nabla}_{\theta}L(\theta ,\lambda )$ is Lipschitz continuous in $\theta $, which further implies that
${\parallel {\nabla}_{\theta}L(\theta ,\lambda )\parallel}^{2}\le {\kappa}_{2}\left(1+{\parallel \theta \parallel}^{2}\right)$ (19) for some constant $$.

3.
${\nabla}_{\lambda}L(\theta ,\lambda )$ is Lipschitz continuous in $\lambda $.
Proof.
Recall from (13) that ${\nabla}_{\theta}\mathrm{log}{\mathbb{P}}_{\theta}(\tau )={\sum}_{t=0}^{H1}{\nabla}_{\theta}{\pi}_{S}({a}_{t}{x}_{t};\theta )/{\pi}_{S}({a}_{t}{x}_{t};\theta )$ whenever ${\pi}_{S}({a}_{t}{x}_{t};\theta )>\psi $ for all $t$ and for some $\psi >0$. Assumption 2 indicates that ${\nabla}_{\theta}{\pi}_{S}({a}_{t}{x}_{t};\theta )$ is $\mathcal{L}$Lispchitz continuous in $\theta $. Then using the fact the sum of the product of (bounded) Lipschitz functions is Lipschitz itself, one can conclude the Lipschitz continuity of ${\nabla}_{\theta}\mathrm{log}{\mathbb{P}}_{\theta}(\tau )$, and we denote by ${L}_{1}$ its finite Lipschitz constant. Also, noting that $$ w.p. 1, then $$ w.p. 1. The Lipschitz continuity implies that for any fixed ${\theta}_{0}\in \mathrm{\Theta}$,
$\parallel {\nabla}_{\theta}\mathrm{log}{\mathbb{P}}_{\theta}(\tau )\parallel \le \parallel {\nabla}_{\theta}\mathrm{log}{\mathbb{P}}_{\theta}(\tau ){}_{\theta ={\theta}_{0}}\parallel +{L}_{1}\parallel \theta {\theta}_{0}\parallel \le {K}_{1}(\tau )(1+\parallel \theta \parallel ).$  (20) 
The first inequality follows from the linear growth condition of Lipschitz functions and the last one holds for a suitable value of $$. Taking the square of both sides of (20) yields (18) with $$.
Since ${\mathbb{P}}_{\theta}(\tau )$ and $\mathrm{log}{\mathbb{P}}_{\theta}(\tau )$ are continuously differentiable in $\theta $ whenever ${\mathbb{P}}_{\theta}(\tau )>0$, the Lipschitz continuity of ${\nabla}_{\theta}L(\theta ,\lambda )$ can be investigated, from its definition (12), as the sums of products of (bounded) Lipschitz functions. From the definition (12), and recalling Assumption 1 and the compactness of $\mathrm{\Theta}$, one can verify the validity of (19) with
$$  (21) 
Finally, 3 immediately follows from the fact that ${\nabla}_{\lambda}L(\theta ,\lambda )$ is a constant function of $\lambda $. ∎
B.1 Convergence of PDPG Algorithm
We use the standard procedure for proving the convergence of the PDPG algorithm. The proof steps are common for stochastic approximation methods and we refer the reader to Chow et al. [2017], Bhatnagar et al. [2009] and references therein for more details. We summarize the scheme of the proof in the following steps:

1.
Tracking o.d.e.: Under Assumption 3, one can view the PDPG as a twotimescale stochastic approximation method. Then, using the results of Section 6 of Borkar [2009], we show that the sequence of $({\theta}^{k},{\lambda}^{k})$ converges almost surely to a stationary point $({\theta}^{*},{\lambda}^{*})$ of the corresponding continuoustime dynamical system.

2.
Lyapunov Stability: By using Lyapunov analysis, we show that the continuoustime system is locally asymptotically stable at a firstorder stationary point.

3.
Saddle Point Analysis: Since we have used the Lagrangian as the Lyapunov function, it implies the system is stable in the stationary point of the Lagrangian, which is, in fact, a local saddle point. Finally, we show that with an appropriate initial policy, the policy converges to a local optimal solution ${\theta}^{*}$ for the OPTR.
First, let us denote by ${\mathrm{\Psi}}_{\mathrm{\Xi}}\left[f(\xi )\right]$ the right directional derivative of ${\mathrm{\Gamma}}_{\mathrm{\Xi}}(\xi )$ in the direction of $f(\xi )$, defined as
${\mathrm{\Psi}}_{\mathrm{\Xi}}\left[f(\xi )\right]\u2254\underset{\alpha \downarrow 0}{lim}{\displaystyle \frac{{\mathrm{\Gamma}}_{\mathrm{\Xi}}\left[\xi +\alpha f(\xi )\right]{\mathrm{\Gamma}}_{\mathrm{\Xi}}\left[\xi \right]}{\alpha}}$ 
for any compact set $\mathrm{\Xi}$ and $\xi \in \mathrm{\Xi}$.
Since $\theta $ converges on a faster timescale than $\lambda $ by Assumption 3, one can write the $\theta $update rule (15) with a relation that is invariant to $\lambda $:
$${\theta}^{k+1}={\mathrm{\Gamma}}_{\mathrm{\Theta}}\left[{\theta}^{k}{\alpha}_{1}(k)\left({\frac{1}{N}\sum _{j=1}^{N}{\nabla}_{\theta}\mathrm{log}{\mathbb{P}}_{\theta}({\tau}_{j}^{k})}_{\theta ={\theta}^{k}}\left(J({\tau}_{j}^{k})+\lambda \mathrm{log}\frac{{\mathbb{P}}_{\theta}({\tau}_{j}^{k})}{{\mathbb{P}}_{\varphi}({\tau}_{j}^{k})}+\lambda \right)\right)\right].$$ 
Consider the continuoustime dynamics of $\theta \in \mathrm{\Theta}$ defined as
$\dot{\theta}={\mathrm{\Psi}}_{\mathrm{\Theta}}\left[{\nabla}_{\theta}L(\theta ,\lambda )\right],$  (22) 
where by using the right directional derivative ${\mathrm{\Psi}}_{\mathrm{\Theta}}\left[{\nabla}_{\theta}L(\theta ,\lambda )\right]$ in the gradient descent algorithm for $\theta $, the gradient will point in the descent direction of $L(\theta ,\lambda )$ along the boundary of $\mathrm{\Theta}$ (denoted by $\partial \mathrm{\Theta}$) whenever the $\theta $update hits the boundary. We refer the interested reader to Section 5.4 of Borkar [2009] for discussions about the existence of the limit in (22).
Since $\lambda $ converges in the slowest timescale, the $\lambda $update rule (16) can be rewritten for a converged value ${\theta}^{*}(\lambda )$ as
${\lambda}^{k+1}={\mathrm{\Gamma}}_{\mathrm{\Lambda}}\left[{\lambda}^{k}+{\alpha}_{2}(k)\left({\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}\mathrm{log}{\displaystyle \frac{{\mathbb{P}}_{{\theta}^{*}(\lambda )}({\tau}_{j}^{k})}{{\mathbb{P}}_{\varphi}({\tau}_{j}^{k})}}\delta \right)\right].$ 
Consider the continuoustime dynamics corresponding to $\lambda $, i.e.
$\dot{\lambda}={\mathrm{\Psi}}_{\mathrm{\Lambda}}\left[{\nabla}_{\lambda}L(\theta ,\lambda )\right],$  (23) 
where by using ${\mathrm{\Psi}}_{\mathrm{\Lambda}}\left[{\nabla}_{\lambda}L(\theta ,\lambda )\right]$ in the gradient ascent algorithm, the gradient will point in the ascent direction along the boundary of $\mathrm{\Lambda}$ (denoted by $\partial \mathrm{\Lambda}$) whenever the $\lambda $update hits the boundary.
We prove Theorem 1 next.
Proof.
Convergence of the $\theta $update: First, we need to show that the assumptions of Lemma 1 in Chapter 6 of Borkar [2009] hold for the $\theta $update and an arbitrary value of $\lambda $. Let us justify these assumptions: (i) the Lipschitz continuity follows from Lemma 1, and (ii) the stepsize rules follow from Assumption 3. (iii) For an arbitrary value $\lambda $, one can write the $\theta $update as a stochastic approximation, i.e.,
${\theta}^{k+1}={\mathrm{\Gamma}}_{\mathrm{\Theta}}\left[{\theta}^{k}+{\alpha}_{1}(k)\left({{\nabla}_{\theta}L(\theta ,\lambda )}_{\theta ={\theta}^{k}}+{M}_{{\theta}_{k+1}}\right)\right],$  (24) 
where
${M}_{{\theta}^{k+1}}={{\nabla}_{\theta}L(\theta ,\lambda )}_{\theta ={\theta}^{k}}{{\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}{\nabla}_{\theta}\mathrm{log}{\mathbb{P}}_{\theta}({\tau}_{j}^{k})}_{\theta ={\theta}^{k}}\left(J({\tau}_{j}^{k})+{\lambda}^{k}\mathrm{log}{\displaystyle \frac{{\mathbb{P}}_{\theta}({\tau}_{j}^{k})}{{\mathbb{P}}_{\varphi}({\tau}_{j}^{k})}}+{\lambda}^{k}\right).$  (25) 
For ${M}_{{\theta}^{k+1}}$ to be a Martingale difference error term, we need to show that its expectation with respect to the filtration ${\mathcal{F}}_{\theta}^{k}=\sigma ({\theta}^{m},{M}_{{\theta}^{m}},m\le k)$ is zero and that it is square integrable with $\mathbb{E}\left[{\parallel {M}_{{\theta}^{k+1}}\parallel}^{2}{\mathcal{F}}_{\theta}^{k}\right]\le {\kappa}^{k}(1+{\parallel {\theta}^{k}\parallel}^{2})$ for some ${\kappa}^{k}$. Since the trajectories ${\mathcal{T}}^{k}$ are generated from the probability mass function ${\mathbb{P}}_{{\theta}^{k}}(\cdot )$, it immediately follows that $\mathbb{E}\left[{M}_{{\theta}^{k+1}}{\mathcal{F}}_{\theta}^{k}\right]=0$. Also, we have:
${\parallel {M}_{{\theta}^{k+1}}\parallel}^{2}$  
$\le 2\parallel {\nabla}_{\theta}L(\theta ,\lambda ){}_{\theta ={\theta}^{k}}{\parallel}^{2}+{\displaystyle \frac{2}{{N}^{2}}}{({\displaystyle \frac{{C}_{max}}{1\gamma}}+{\lambda}_{max}({D}_{max}^{k}+1))}^{2}\parallel {\displaystyle \sum _{j=1}^{N}}{\nabla}_{\theta}\mathrm{log}{\mathbb{P}}_{\theta}({\tau}_{j}^{k}){}_{\theta ={\theta}^{k}}\parallel {}^{2}$  
$\le 2{\kappa}_{2}^{k}\left(1+{\parallel {\theta}^{k}\parallel}^{2}\right)+{\displaystyle \frac{{2}^{N}}{{N}^{2}}}{\left({\displaystyle \frac{{C}_{max}}{1\gamma}}+{\lambda}_{max}\left({D}_{max}^{k}+1\right)\right)}^{2}\left({\displaystyle \sum _{j=1}^{N}}{\kappa}_{1}^{k}({\tau}_{j}^{k})\left(1+{\parallel {\theta}^{k}\parallel}^{2}\right)\right)$  
$\le {\kappa}^{k}\left(1+{\parallel {\theta}^{k}\parallel}^{2}\right),$ 
where
${D}_{max}^{k}$  $=\underset{1\le j\le N}{\mathrm{max}}\mathrm{log}{\displaystyle \frac{{\mathbb{P}}_{\theta}({\tau}_{j}^{k})}{{\mathbb{P}}_{\varphi}({\tau}_{j}^{k})}},\text{and}$  
${\kappa}^{k}$  $$ 
The first and second inequality uses the relation ${\parallel {\sum}_{i=1}^{N}{a}_{i}\parallel}^{2}\le {2}^{N1}({\sum}_{i=1}^{N}{\parallel a\parallel}^{2})$. Also, the second one uses the results of Lemma 1. Finally, the boundedness of ${\kappa}^{k}$ follows from Assumption 1 and having $$, $$ w.p. 1. Finally, (iv) $$ almost surely, because all ${\theta}^{k}$ are within the compact set $\mathrm{\Theta}$. Hence, by Theorem 2 of Chapter 2 in Borkar [2009], the sequence $\{{\theta}^{k}\}$ converges almost surely to a (possibly sample path dependent) internally chain transitive invariant set of o.d.e. (22).
For a given $\lambda $, define the Lyapunov function
${\mathcal{L}}_{\lambda}(\theta )=L(\theta ,\lambda )L({\theta}^{*},\lambda ),$  (26) 
where ${\theta}^{*}\in \mathrm{\Theta}$ is a local minimum point. For the sake of simplifying the proof, let us consider that ${\theta}^{*}$ is an isolated local minimum point, i.e., there exists $r$ such that for all $\theta \in {\mathbb{B}}_{r}({\theta}^{*})$, ${\mathcal{L}}_{\lambda}(\theta )>{\mathcal{L}}_{\lambda}({\theta}^{*})$. This means that the Lyapunov function ${\mathcal{L}}_{\lambda}(\theta )$ is locally positive definite, i.e., ${\mathcal{L}}_{\lambda}({\theta}^{*})=0$ and ${\mathcal{L}}_{\lambda}(\theta )>0$ for ${\mathbb{B}}_{r}\setminus \{{\theta}^{*}\}$.
If we establish the negative semidefiniteness of $d{\mathcal{L}}_{\lambda}(\theta )/dt\le 0$, then we can use the Lyapunov stability theorems to show the convergence of the dynamical system. Consider the time derivative of corresponding continuoustime system for $\theta $, i.e.,
$\frac{d{\mathcal{L}}_{\lambda}(\theta )}{dt}}={\displaystyle \frac{dL(\theta ,\lambda )}{dt}}={({\nabla}_{\theta}L(\theta ,\lambda ))}^{T}{\mathrm{\Psi}}_{\mathrm{\Theta}}({\nabla}_{\theta}L(\theta ,\lambda )).$  (27) 
Consider two cases:

1.
For a fixed ${\theta}_{0}\in \mathrm{\Theta}$, there exists ${\alpha}_{0}>0$ such that the update ${\theta}_{0}{\alpha {\nabla}_{\theta}L(\theta ,\lambda )}_{\theta ={\theta}_{0}}\in \mathrm{\Theta}$ for all $\alpha \in (0,{\alpha}_{0}]$. In this case, ${\mathrm{\Psi}}_{\mathrm{\Theta}}({\nabla}_{\theta}L(\theta ,\lambda ))={\nabla}_{\theta}L(\theta ,\lambda )$, which further implies that
$\frac{dL({\theta}_{0},\lambda )}{dt}}=\parallel {\nabla}_{\theta}L(\theta ,\lambda ){}_{\theta ={\theta}_{0}}{\parallel}^{2}\le 0,$ and this quantity is nonzero as long as $\parallel {\mathrm{\Psi}}_{\mathrm{\Theta}}({\nabla}_{\theta}L(\theta ,\lambda ))\parallel \ne 0$.

2.
For fixed ${\theta}_{0}\in \mathrm{\Theta}$ and any ${\alpha}_{0}>0$, there exists $\alpha \in (0,{\alpha}_{0}]$ such that ${\theta}_{\alpha}\u2254{\theta}_{0}{\alpha {\nabla}_{\theta}L(\theta ,\lambda )}_{\theta ={\theta}_{0}}\notin \mathrm{\Theta}$. The projection ${\mathrm{\Gamma}}_{\mathrm{\Theta}}({\theta}_{\alpha})={argmin}_{\theta \in \mathrm{\Theta}}\frac{1}{2}{\parallel \theta {\theta}_{\alpha}\parallel}^{2}$ maps ${\theta}_{\alpha}$ to a point in $\partial \mathrm{\Theta}$. This projection is singlevalued because of the compactness and convexity of $\mathrm{\Theta}$, and we denote the projected point by ${\overline{\theta}}_{\alpha}\in \mathrm{\Theta}$. Consider $\alpha \downarrow 0$, then
${({\nabla}_{\theta}L(\theta ,\lambda ))}^{T}{\mathrm{\Psi}}_{\mathrm{\Theta}}({\nabla}_{\theta}L(\theta ,\lambda ))$ $=\underset{\alpha \downarrow 0}{lim}{\displaystyle \frac{{(\theta {\theta}_{\alpha})}^{T}({\overline{\theta}}_{\alpha}\theta )}{\eta}}$ $=\underset{\alpha \downarrow 0}{lim}{\displaystyle \frac{{\parallel {\overline{\theta}}_{\alpha}\theta }^{2}}{{\eta}^{2}}}+{\displaystyle \frac{{({\overline{\theta}}_{\alpha}{\theta}_{\alpha})}^{T}({\overline{\theta}}_{\alpha}\theta )}{{\eta}^{2}}}\le 0,$ where the last inequality follows from the Projection Theorem (see Proposition 1.1.9 of Bertsekas [2009]). Again, one can verify that the timederivative quantity is nonzero as long as $\parallel {\mathrm{\Psi}}_{\mathrm{\Theta}}({\nabla}_{\theta}L(\theta ,\lambda ))\parallel \ne 0$.
In summary, $d{\mathcal{L}}_{\lambda}(\theta )/dt\le 0$ and this quantity is nonzero as long as $\parallel {\mathrm{\Psi}}_{\mathrm{\Theta}}({\nabla}_{\theta}L(\theta ,\lambda ))\parallel \ne 0$. Then by LaSalle’s Local Invariant Set Theorem (see, e.g., Theorem 3.4 of Slotine et al. [1991]), we conclude that the dynamical system tends to the largest positive invariant set within ${\mathbb{M}}_{\theta}\u2254\{\theta :\parallel {\mathrm{\Psi}}_{\mathrm{\Theta}}({\nabla}_{\theta}L(\theta ,\lambda ))\parallel =0\}$. Notice that ${\theta}^{*}\in {\mathbb{M}}_{\theta}$. Let $l>0$ be equal to $\mathrm{min}\{{\mathcal{L}}_{\lambda}(\theta ):\parallel {\mathrm{\Psi}}_{\mathrm{\Theta}}({\nabla}_{\theta}L(\theta ,\lambda ))\parallel =0,\theta \in {\mathbb{B}}_{r}({\theta}^{*})\setminus {\theta}^{*}\}$. Then every trajectory starting from the attraction region $$ will tend to the local minimum ${\theta}^{*}$. Since we chose ${\theta}^{*}$ to be arbitrary, this holds for all local minima. Hence, using Corollary $4$ of Chapter 2 in Borkar [2009], we conclude that if the initial policy ${\theta}^{0}$ is within the attraction region of a local minimum point ${\theta}^{*}$, then it will converge to it almost surely.
Remark.
The case in which ${\theta}^{\mathrm{*}}$ is not isolated can be handled similarly, with the minor difference that the convergence happens to a set of optimal points instead of to a single point.
Convergence of the $\lambda $update: We need to show that the assumptions of Theorem 2 in Chapter 6 of Borkar [2009] hold for the twotimescale stochastic approximation theory. Let us verify the validity of these assumptions: (i) ${\nabla}_{\lambda}L(\theta ,\lambda )$ is a Lipschitz function in $\lambda $ from Lemma 1, and (ii) stepsize rules follow from Assumption 3. (iii) Since $\lambda $ converges in a slower timescale, we have $\parallel {\theta}^{k,i}{\theta}^{*}({\lambda}^{k})\parallel \to 0$ almost surely as $i\to \mathrm{\infty}$, which, according to the Lipschitz continuity of ${\nabla}_{\lambda}L(\theta ,\lambda )$, implies that
$\parallel {\nabla}_{\lambda}L(\theta ,\lambda ){}_{\theta ={\theta}^{k,i},\lambda ={\lambda}^{k}}{\nabla}_{\lambda}L(\theta ,\lambda ){}_{\theta ={\theta}^{*}({\lambda}^{k}),\lambda ={\lambda}^{k}}\parallel \to 0\mathit{\hspace{1em}}\text{as}i\to \mathrm{\infty}.$  (28) 
Hence the $\lambda $update can be written as
${\lambda}^{k+1}={\mathrm{\Gamma}}_{\mathrm{\Lambda}}\left[{\lambda}^{k}+{\alpha}_{2}(k)\left({{\nabla}_{\lambda}L(\theta ,\lambda )}_{\theta ={\theta}^{*}({\lambda}^{k}),\lambda ={\lambda}^{k}}+{M}_{{\lambda}^{k+1}}\right)\right],$ 
where
${M}_{{\lambda}^{k+1}}={{\nabla}_{\lambda}L(\theta ,\lambda )}_{\theta ={\theta}^{*}({\lambda}^{k}),\lambda ={\lambda}^{k}}+\left({\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}\mathrm{log}{\displaystyle \frac{{\mathbb{P}}_{{\theta}^{*}({\lambda}^{k})}({\tau}_{j}^{k})}{{\mathbb{P}}_{\varphi}({\tau}_{j}^{k})}}\delta \right).$  (29) 
From (29), we can verify that $\mathbb{E}\left[{M}_{{\lambda}^{k+1}}{\mathcal{F}}_{\lambda}^{k}\right]=0$, where ${\mathcal{F}}_{\lambda}^{k}=\sigma ({\lambda}^{m},{M}_{{\lambda}^{m}},m\le k)$ is a filtration of $\lambda $ generated by different independent trajectories. Also, we have:
$$ 
Hence, ${M}_{{\lambda}^{k+1}}$ is a Martingale difference error. Also, (v) $$. Recall that from the convergence analysis of the $\theta $update for a ${\lambda}^{k}$, we know that ${\theta}^{*}({\lambda}^{k})$ is an asymptotically stable point. Then by Theorem 2 of Chapter 6 in Borkar [2009], we can conclude that $({\theta}^{k},{\lambda}^{k})$ converges almost surely to $({\theta}^{*}({\lambda}^{*}),{\lambda}^{*})$, where ${\lambda}^{*}$ belongs to an internally chain transitive invariant set of (23).
Define the Lyapunov function:
$\mathcal{L}(\lambda )=L({\theta}^{*}(\lambda ),\lambda )+L({\theta}^{*}({\lambda}^{*}),{\lambda}^{*}),$ 
where ${\lambda}^{*}$ is a local maximum point, i.e., there exists $r$ such that for any $\lambda \in {\mathcal{B}}_{r}({\lambda}^{*})$, the Lyapunov function $\mathcal{L}(\lambda )$ is positive definite. We can follow similar lines of arguments as we did for the $\theta $update to show that $\frac{d\mathcal{L}(\lambda )}{dt}\le 0$ and this quantity is nonzero as long as ${\mathrm{\Psi}}_{\mathrm{\Lambda}}({\nabla}_{\lambda}L({\theta}^{*}(\lambda ),\lambda ))\ne 0$. Then by using the results of LaSalle’s Local Invariant Set Theorem, we can establish the convergence of the dynamical system to the largest invariant set within ${\mathbb{M}}_{\lambda}\u2254\{\lambda :{\mathrm{\Psi}}_{\mathrm{\Lambda}}({\nabla}_{\lambda}L({\theta}^{*}(\lambda ),\lambda ))=0\}$. This means that ${\lambda}^{*}\in {\mathbb{M}}_{\lambda}$ is a stationary point. Let $l=\mathrm{min}\{\mathcal{L}(\lambda ):{\mathrm{\Psi}}_{\mathrm{\Lambda}}({\nabla}_{\lambda}L({\theta}^{*}(\lambda ),\lambda ))=0,\lambda \in {\mathcal{B}}_{r}({\lambda}^{*})\setminus {\lambda}^{*}\}$. Then, every trajectory starting with ${\lambda}^{0}$ in $$ will tend to ${\lambda}^{*}$ w.p.1.
Saddle Point Analysis: By denoting ${\theta}^{*}={\theta}^{*}({\lambda}^{*})$, we want to show that $({\theta}^{*},{\lambda}^{*})$ is, in fact, a saddle point of the Lagrangian $L(\theta ,\lambda )$. Recall that, as we proved in the convergence the of $\theta $update, ${\theta}^{*}$ is a local minimum of $L(\theta ,\lambda )$ within a sufficiently small ball around itself, i.e., there exists $r>0$ such that
$L({\theta}^{*},{\lambda}^{*})\le L(\theta ,{\lambda}^{*}),\forall \theta \in \mathrm{\Theta}\cap {\mathcal{B}}_{r}({\theta}^{*}).$  (30) 
It is easy to verify that ${\theta}^{*}$ is a feasible solution of (OPTR) whenever ${\lambda}^{*}\in [0,{\lambda}_{max})$, i.e.
${D}_{KL}({\theta}^{*}\parallel \varphi )\le \delta .$  (31) 
To show this, assume for a contradiction that ${D}_{KL}({\theta}^{*}\parallel \varphi )\delta >0$. Then,
${\mathrm{\Psi}}_{\mathrm{\Lambda}}\left[{{\nabla}_{\lambda}L(\theta ,\lambda )}_{\theta ={\theta}^{*},\lambda ={\lambda}^{*}}\right]$  $=\underset{\alpha \downarrow 0}{lim}{\displaystyle \frac{{\mathrm{\Gamma}}_{\mathrm{\Lambda}}\left[{\lambda}^{*}+{\alpha {\nabla}_{\lambda}L(\theta ,\lambda )}_{\theta ={\theta}^{*},\lambda ={\lambda}^{*}}\right]{\mathrm{\Gamma}}_{\mathrm{\Lambda}}\left[{\lambda}^{*}\right]}{\alpha}}$  
$=\underset{\alpha \downarrow 0}{lim}{\displaystyle \frac{{\mathrm{\Gamma}}_{\mathrm{\Lambda}}\left[{\lambda}^{*}+\alpha \left({D}_{KL}({\theta}^{*}\parallel \varphi )\delta \right)\right]{\mathrm{\Gamma}}_{\mathrm{\Lambda}}\left[{\lambda}^{*}\right]}{\alpha}}$  
$={D}_{KL}({\theta}^{*}\parallel \varphi )\delta >0,$ 
which contradicts the fact that ${\mathrm{\Psi}}_{\mathrm{\Lambda}}\left[{{\nabla}_{\lambda}L(\theta ,\lambda )}_{\theta ={\theta}^{*},\lambda ={\lambda}^{*}}\right]=0$. Notice that the feasibility cannot be verified when ${\lambda}^{*}={\lambda}_{max}$, because ${\mathrm{\Psi}}_{\mathrm{\Lambda}}\left[{{\nabla}_{\lambda}L(\theta ,\lambda )}_{\theta ={\theta}^{*}({\lambda}_{max}),\lambda ={\lambda}_{max}}\right]=0$ when ${D}_{KL}({\theta}^{*}\parallel \varphi )>\delta $. In this case, we increase ${\lambda}_{max}$ (e.g., we set ${\lambda}_{max}\leftarrow 2{\lambda}_{max}$ in our algorithm) if such a behavior happens until it converges to an interior point of $[0,{\lambda}_{max}]$.
In addition, the complementary slackness condition
${\lambda}^{*}({D}_{KL}({\theta}^{*}\parallel \varphi )\delta )=0$  (32) 
holds. To show this, we only need to verify that $$ yields ${\lambda}^{*}=0$. For a contradiction, suppose that ${\lambda}^{*}\in (0,{\lambda}_{max})$. Then, we have
${\mathrm{\Psi}}_{\mathrm{\Lambda}}\left[{{\nabla}_{\lambda}L(\theta ,\lambda )}_{\theta ={\theta}^{*}({\lambda}^{*}),\lambda ={\lambda}^{*}}\right]$  $$ 
which contradicts the fact that ${\mathrm{\Psi}}_{\mathrm{\Lambda}}\left[{{\nabla}_{\lambda}L(\theta ,\lambda )}_{\theta ={\theta}^{*},\lambda ={\lambda}^{*}}\right]=0$, meaning that ${\lambda}^{*}=0$ in this case. Hence, we have:
$L({\theta}^{*},{\lambda}^{*})$  $={V}_{{\theta}^{*}}({x}_{0})+{\lambda}^{*}\left({D}_{KL}({\theta}^{*}\parallel \varphi )\delta \right)$  
$={V}_{{\theta}^{*}}({x}_{0})$  
$\ge {V}_{{\theta}^{*}}({x}_{0})+\lambda \left({D}_{KL}({\theta}^{*}\parallel \varphi )\delta \right)=L({\theta}^{*},\lambda ).$  (33) 
From (30) and (33), we observe that $({\theta}^{*},{\lambda}^{*})$ is a saddle point of $L(\theta ,\lambda )$, so according to the saddle point theorem, ${\theta}^{*}$ is a local minimum of (OPTR). Recall that the result of Theorem 1 depends on the initial values for ${\theta}^{0}$ and ${\lambda}^{0}$, so the convergence to a local minimum is sample path depenedant. ∎
B.2 Proof of Corollary 1
Proof.
From the convergence analysis of the $\theta $update, we know that $\{{\theta}^{k}\}$ converges almost surely to the largest invariant set within ${\mathbb{M}}_{\theta}$, and similarly, $\{{\lambda}^{k}\}$ converges almost surely to the largest invariant set within ${\mathbb{M}}_{\lambda}$. We also know from (31) that ${\theta}^{*}$ is a feasible point of (OPTR). When ${\lambda}^{*}=0$, then $L({\theta}^{*},{\lambda}^{*})={V}_{{\theta}^{*}}({x}_{0})$. Also, for ${\lambda}^{*}>0$, the complementary slackness condition (32) implies ${D}_{KL}({\theta}^{*}\parallel \varphi )=\delta $. Hence ${{\nabla}_{\theta}{D}_{KL}(\theta \parallel \varphi )}_{\theta ={\theta}^{*}}=0$, which in turn, means that
${{\nabla}_{\theta}L(\theta ,{\lambda}^{*})}_{\theta ={\theta}^{*}}={{\nabla}_{\theta}{V}_{\theta}({x}_{0})}_{\theta ={\theta}^{*}}+{{\lambda}^{*}{\nabla}_{\theta}{D}_{KL}(\theta \parallel \varphi )}_{\theta ={\theta}^{*}}={{\nabla}_{\theta}{V}_{\theta}({x}_{0})}_{\theta ={\theta}^{*}}.$  (34) 
Hence, for a ${\theta}^{*}$ located in the interior of $\mathrm{\Theta}$, we have ${{\nabla}_{\theta}L(\theta ,{\lambda}^{*})}_{\theta ={\theta}^{*}}={{\nabla}_{\theta}{V}_{\theta}({x}_{0})}_{\theta ={\theta}^{*}}=0$, so it is a firstorder stationary point of (OPTR). However, if ${\theta}^{*}\in \partial \mathrm{\Theta}$, it is possible to have $\parallel {\nabla}_{\theta}L(\theta ,{\lambda}^{*}){}_{\theta ={\theta}^{*}}\parallel \ne 0$. ∎
Remark.
In practice, we choose the projection set $\mathrm{\Theta}$ large enough so that the latter case (convergence to boundary) will not happen. For example, assuring that the weights of a neural network do not diverge is a sufficient criterion to use instead of the projection operator ${\mathrm{\Gamma}}_{\mathrm{\Theta}}$.
B.3 Equivalent Results for (OPTF)
A similar PDPG algorithm to the one proposed in Algorithm A.2 can solve (OPTF), only requiring a slight modification of rules (15) and (16) as
${\theta}^{k+1}$  $={\mathrm{\Gamma}}_{\mathrm{\Theta}}\left[{\theta}^{k}{\alpha}_{1}(k)\left({{\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}{\nabla}_{\theta}\mathrm{log}{\mathbb{P}}_{\theta}({\tau}_{j}^{k})}_{\theta ={\theta}^{k}}\left(J({\tau}_{j}^{k})+{\lambda}^{k}IS({\tau}_{j}^{k})\mathrm{log}{\displaystyle \frac{{\mathbb{P}}_{\varphi}({\tau}_{j}^{k})}{{\mathbb{P}}_{\theta}({\tau}_{j}^{k})}}{\lambda}^{k}\right)\right)\right]$  
${\lambda}^{k+1}$  $={\mathrm{\Gamma}}_{\mathrm{\Lambda}}\left[{\lambda}^{k}+{\alpha}_{2}(k)\left({\displaystyle \frac{1}{N}}{\displaystyle \sum _{j=1}^{N}}IS({\tau}_{j}^{k})\mathrm{log}{\displaystyle \frac{{\mathbb{P}}_{\varphi}({\tau}_{j}^{k})}{{\mathbb{P}}_{\theta}({\tau}_{j}^{k})}}\delta \right)\right],$ 
where $IS({\tau}_{j}^{k})={\mathbb{P}}_{\varphi}({\tau}_{j}^{k})/{\mathbb{P}}_{\theta}({\tau}_{j}^{k})$ is the importance sampling weight added to account for the bias introduced by sampling under the student’s policy. To ensure a welldefined (OPTF), we need the following assumption:
Assumption 4.
Welldefined (OPTF): for any state–action pair $\mathrm{(}\mathrm{x}\mathrm{,}\mathrm{a}\mathrm{)}\mathrm{\in}\mathrm{X}\mathrm{\times}\mathrm{A}$ with ${\mathrm{\pi}}_{\mathrm{S}}\mathrm{}\mathrm{(}\mathrm{x}\mathrm{,}\mathrm{a}\mathrm{)}\mathrm{=}\mathrm{0}$, we have ${\mathrm{\pi}}_{\mathrm{T}}\mathrm{}\mathrm{(}\mathrm{x}\mathrm{,}\mathrm{a}\mathrm{)}\mathrm{=}\mathrm{0}$.
This assumption ensures a similar criterion to that of Assumption 1, but notice that in this case, the student might take any action, regardless of the teacher’s policy. Exactly the same steps can be taken, virtually verbatim, to prove the following convergence property of the PDPG algorithm for (OPTF).
Theorem 2.
Under Assumptions 2, 3, and 4, the sequence of policy updates (starting from ${\theta}^{\mathrm{0}}$ sufficiently close to a local optimum point ${\theta}^{\mathrm{*}}$) and Lagrange multipliers converges almost surely to a saddle point of the Lagrangian, i.e., $\mathrm{(}\theta \mathrm{(}k\mathrm{)}\mathrm{,}\lambda \mathrm{(}k\mathrm{)}\mathrm{)}\stackrel{a\mathrm{.}s\mathrm{.}}{\mathrm{\u27f6}}\mathrm{(}{\theta}^{\mathrm{*}}\mathrm{,}{\lambda}^{\mathrm{*}}$). Then, ${\theta}^{\mathrm{*}}$ is the local optimal solution of (OPTF).
Appendix C Practical PDPG Algorithm
A naive implementation of Algorithm A.2 would result in a highvariance training procedure. In this section, we discuss several techniques for variance reduction, resulting in a more stable algorithm compared to the one proposed in Algorithm A.2.
C.1 Stepwise KLdivergence Measure
In the policy distillation literature, some studies use a trajectorywise KLdivergence (KLF) as the distance metric Teh et al. [2017], but the stepwise KLdivergence between the distribution is also common Ghosh et al. [2017], which is defined as:
${D}_{KL}^{step}(\varphi \parallel \theta )={\mathbb{E}}_{x\sim {d}_{{\pi}_{T}}}\left[{D}_{KL}({\pi}_{T}(\cdot x;\varphi )\parallel {\pi}_{S}(\cdot x;\theta ))\right],$  (35) 
where
${D}_{KL}({\pi}_{T}(\cdot x;\varphi )\parallel {\pi}_{S}(\cdot x;\theta ))={\displaystyle \sum _{a\in \mathcal{A}}}{\pi}_{T}(ax;\varphi )\mathrm{log}{\displaystyle \frac{{\pi}_{T}(ax;\varphi )}{{\pi}_{S}(ax;\theta )}}.$  (36) 
In the next proposition, we explore the relations between these two methods.
Proposition 1.
The following relation holds between the trajectorywise and stepwise KLdivergence metrics:
${D}_{KL}(\varphi \parallel \theta )\le \mathbb{E}[H]{D}_{KL}^{step}(\varphi \parallel \theta )$  (37) 
Proof.
According to the definition of trajectorywise KLdivergence, we have:
${D}_{KL}({\mathbb{P}}_{\varphi}(\tau ){\mathbb{P}}_{\theta}(\tau ))$  $={\displaystyle \sum _{\tau}}{\mathbb{P}}_{\varphi}(\tau )\mathrm{log}{\displaystyle \frac{{\mathbb{P}}_{\varphi}(\tau )}{{\mathbb{P}}_{\theta}(\tau )}}$  
$={\displaystyle \sum _{\tau}}{\mathbb{P}}_{\varphi}(\tau )\mathrm{log}{\displaystyle \frac{\mu ({x}_{0}){\prod}_{t=0}^{H1}{\pi}_{T}({a}_{t}{x}_{t};\varphi )P({x}_{t+1}{x}_{t},{a}_{t})}{\mu ({x}_{0}){\prod}_{t=0}^{H1}{\pi}_{S}({a}_{t}{x}_{t};\theta )P({x}_{t+1}{x}_{t},{a}_{t})}}$  
$={\displaystyle \sum _{\tau}}{\mathbb{P}}_{\varphi}(\tau ){\displaystyle \sum _{t=0}^{H1}}\mathrm{log}{\displaystyle \frac{{\pi}_{T}({a}_{t}{x}_{t};\varphi )}{{\pi}_{S}({a}_{t}{x}_{t};\theta )}}$  
$={\displaystyle \sum _{x\in \mathcal{X},a\in \mathcal{A}}}{\displaystyle \sum _{\tau}}{\mathbb{P}}_{\varphi}(\tau ){\displaystyle \sum _{t=0}^{H1}}{\mathbb{I}}_{t}(\tau ;x,a)\mathrm{log}{\displaystyle \frac{{\pi}_{T}(ax;\varphi )}{{\pi}_{S}(ax;\theta )}}$  
$\le {\displaystyle \sum _{x\in \mathcal{X},a\in \mathcal{A}}}\mathbb{E}[H]{d}_{{\pi}_{T}}(x){\pi}_{T}(ax;\varphi )\mathrm{log}{\displaystyle \frac{{\pi}_{T}(ax;\varphi )}{{\pi}_{S}(ax;\theta )}}$  
$=\mathbb{E}[H]{\displaystyle \sum _{x\in \mathcal{X}}}{d}_{{\pi}_{T}}(x){\displaystyle \sum _{a\in \mathcal{A}}}{\pi}_{T}(ax;\varphi )\mathrm{log}{\displaystyle \frac{{\pi}_{T}(ax;\varphi )}{{\pi}_{S}(ax;\theta )}}$  
$=\mathbb{E}[H]{\mathbb{E}}_{x\sim {d}_{{\pi}_{T}}}\left[{D}_{KL}({\pi}_{T}(\cdot x;\varphi ){\pi}_{S}(\cdot x;\theta ))\right]$ 
Here, ${\mathbb{I}}_{t}(\tau ;x,a)$ is the indicator of whether $({x}_{t}=x,{a}_{t}=a)$ occurs along trajectory $\tau $. Also, ${d}_{{\pi}_{T}}(x)$ is the distribution of being in state $x$ under policy ${\pi}_{T}$, defined as
${d}_{{\pi}_{T}}(x)={\displaystyle \sum _{t=0}^{{H}_{max}}}{d}_{t,{\pi}_{T}(x)}/{H}_{max},$ 
and ${d}_{t,{\pi}_{T}(x)}$ is the probability of being in $x$ at time $t$ under policy ${\pi}_{T}$. ∎
According to this proposition, the stepwise KL distances can be used to provide an upper bound on the trajectorywise one. In other words, if the stepwise KL multiplied by the expected horizon length are less than $\delta $, then it is also correct for the trajectorywise one.