Don't Forget Your Teacher: A Corrective Reinforcement Learning Framework

  • 2019-05-30 01:47:18
  • Mohammadreza Nazari, Majid Jahani, Lawrence V. Snyder, Martin Takáč
  • 4


Although reinforcement learning (RL) can provide reliable solutions in manysettings, practitioners are often wary of the discrepancies between the RLsolution and their status quo procedures. Therefore, they may be reluctant toadapt to the novel way of executing tasks proposed by RL. On the other hand,many real-world problems require relatively small adjustments from the statusquo policies to achieve improved performance. Therefore, we propose astudent-teacher RL mechanism in which the RL (the "student") learns to maximizeits reward, subject to a constraint that bounds the difference between the RLpolicy and the "teacher" policy. The teacher can be another RL policy (e.g.,trained under a slightly different setting), the status quo policy, or anyother exogenous policy. We formulate this problem using a stochasticoptimization model and solve it using a primal-dual policy gradient algorithm.We prove that the policy is asymptotically optimal. However, a naiveimplementation suffers from high variance and convergence to a stochasticoptimal policy. With a few practical adjustments to address these issues, ournumerical experiments confirm the effectiveness of our proposed method inmultiple GridWorld scenarios.


Quick Read (beta)

Don’t Forget Your Teacher:
A Corrective Reinforcement Learning Framework

Mohammadreza Nazari
[email protected]
\AndMajid Jahani
[email protected]
\ANDLawrence V. Snyder
[email protected]
\AndMartin Takáč
[email protected] \AND Department of Industrial and Systems Engineering
Lehigh University, Bethlehem, PA 18015

Although reinforcement learning (RL) can provide reliable solutions in many settings, practitioners are often wary of the discrepancies between the RL solution and their status quo procedures. Therefore, they may be reluctant to adapt to the novel way of executing tasks proposed by RL. On the other hand, many real-world problems require relatively small adjustments from the status quo policies to achieve improved performance. Therefore, we propose a student-teacher RL mechanism in which the RL (the “student”) learns to maximize its reward, subject to a constraint that bounds the difference between the RL policy and the “teacher” policy. The teacher can be another RL policy (e.g., trained under a slightly different setting), the status quo policy, or any other exogenous policy. We formulate this problem using a stochastic optimization model and solve it using a primal-dual policy gradient algorithm. We prove that the policy is asymptotically optimal. However, a naive implementation suffers from high variance and convergence to a stochastic optimal policy. With a few practical adjustments to address these issues, our numerical experiments confirm the effectiveness of our proposed method in multiple GridWorld scenarios.


shapes,arrows, decorations.markings \definecolormyGreenRGB0,175,0


Don’t Forget Your Teacher:
A Corrective Reinforcement Learning Framework

  Mohammadreza Nazari [email protected] Majid Jahani [email protected] Lawrence V. Snyder [email protected] Martin Takáč [email protected] Department of Industrial and Systems Engineering Lehigh University, Bethlehem, PA 18015


noticebox[b]Preprint. Under review.\[email protected]

1 Introduction

We encourage using a new paradigm called corrective RL, in which a reinforcement learning (RL) agent is trained to maximize its reward while not straying “too far” from a previously defined policy. The motivation is twofold: (1) to provide a gentler transition for a decision-maker who is accustomed to using a certain policy but now considers implementing RL, and (2) to develop a framework for gently transitioning from one RL solution to another when the underlying environment has changed.

RL has recently achieved considerable success in artificially created environments, such as Atari games Mnih et al. (2013, 2016) or robotic simulators Lillicrap et al. (2015). Exploiting the power of neural networks in RL algorithms has been shown to exhibit super-human performance by enabling automatic feature extraction and policy representations, but real-world applications are still very limited, conceivably due to lack of representativity of the optimized policies. Over the past few years, a major portion of the RL literature has been developed for RL agents with no prior information about how to do a task. Typically, these algorithms start with random actions and learn while interacting with the environment through trial and error. However, in many practical settings, prior information about good solutions is available, whether from a previous RL algorithm or a human decision-maker’s prior experience. Our approach trains the RL agent to make use of this prior information when optimizing, in order to avoid deviating too far from a target policy.

Although toy environments and Atari games are prevalent in the RL literature due to their simplicity, RL has recently been trying to find its path to real-world applications such as recommender systems Chen et al. (2018), transportation Nazari et al. (2018), Internet of Things Feng et al. (2017), supply chain Gijsbrechts et al. (2018); Oroojlooyjadid et al. (2017) and various control tasks in robotics Gu et al. (2017). In all of these applications, there is a crucial risk that the new policy might not operate logically or safely, as one was expecting it to do. A policy that attains a large reward but deviates too much from a known policy—which follows logical steps and processes—is not desirable for these tasks. For example, users of a system who were accustomed to the old way of doing things would likely find it hard to switch to a newly discovered policy, especially if the benefit of the new policy is not obvious or immediately forthcoming. Indeed, we argue that many real-world tasks only need a small corrective fix to the currently running policies to achieve their desired goals, instead of designing everything from scratch. Throughout this paper, we adhere to this paradigm—we call it “corrective RL”—which utilizes an acceptable policy as a gauge when designing novel policies. We consider two agents, namely a teacher and a student. Our main question is how the student can improve upon the teacher’s policy while not deviating too far from it. More formally, we would like to train a student in a way that it maximizes a long-term RL objective while keeping its own policy close to that of the teacher.

For example, consider an airplane that is controlled by an autopilot that follows the shortest haversine path policy towards the destination. Then, some turbulence occurs, and we want to modify the current path to avoid the turbulence. A “pure” RL algorithm would re-optimize the trajectory from scratch, potentially deviating too far from the optimal path in order to avoid the turbulence. Corrective RL would ensure that the adjustments to the current policy are small, so that the flight follows a similar path and has a similar estimated time of arrival, while ensuring that the passengers experience a more comfortable (less turbulent) flight. Another example is in predictive maintenance, where devices are periodically inspected for possible failures. Inspection schedules are usually prescribed by the device designers, but many environmental conditions affect failure rates, hence there is no guarantee that factory schedules are perfect. If the objective is to reduce downtime with only slight adjustments to the current schedules, conventional RL algorithms would have a hard time finding such policies.

Similar concerns arise in other business and engineering domains as well, including supply chain management, queuing systems, finance, and robotics. For example, an enormous number of exact and inexact methods have been proposed for classical inventory management problems under some assumptions on the demand, e.g., that it follows a normal distribution Snyder and Shen (2019). Once we add more stochasticity to the demand distribution or consider more complicated cost functions, these problems often become intractable using classical methods. Of course, vanilla RL can provide a mechanism for solving more complex inventory optimization problems, but practitioners may prefer policies that are simple to implement and intuitive to understand. Corrective RL can help steer the policy toward the preferred ones, while still maintaining near-optimal performance. Given these examples, one can interpret our approach as an improvement on black-box heuristics, which uses a data-driven approach to improve the performance of these algorithms without dramatic reformulations.

The contributions of this work are as follows: i) we introduce a new paradigm for RL tasks, convenient for many real-world tasks, that improves upon the currently running system’s policy with a small perturbation of the policy; ii) we formulate this problem using an stochastic optimization problem and propose a primal–dual policy gradient algorithm which we prove to be asymptotically optimal, and iii) using practical adjustments, we illustrate how an RL framework can act as an improvement heuristic. We show the effectiveness and properties of the algorithm in multiple GridWorld motion planning experiments.

2 Problem Definition

We consider the standard definition of a Markov decision process (MDP) using a tuple (𝒳,𝒜,C,P,P0). In our notation, 𝒳𝒳{xterm}={1,2,,n,xterm} is the state space, where 𝒳 is the set of transient states and xterm is the terminal state; 𝒜 is the set of actions; C:𝒳×𝒜[0,Cmax] is the cost function; P is the transition probability distribution; and P0 is the distribution of the initial state x0. At each time step t=0,1,, the agent observes xt𝒳, selects at𝒜, and incurs a cost ct=C(xt,at). Selecting the action at at state xt transitions the agent to the next state xt+1P(|xt,at).

Consider two agents, a teacher and a student. The teacher has prior knowledge about the task and prescribes an action for any state that the student encounters, and the student has the authority to follow the teacher’s action or act independently based on its own learned policy. Let πS denote the policy of the student and πT be the policy of the teacher, where both πS and πT are stationary stochastic policies defined as a mapping from a state–action pair to a probability distribution, i.e., πi:𝒳×𝒜[0,1], i{S,T}. For example, πS(a|x) denotes the probability of choosing action a in state x by the student. In policy gradient methods, the policies are represented with a function approximator, usually modeled by a neural network, where we denote by θN and ϕN the corresponding policy weights of the student and teacher, respectively; the teacher and student parameterized policies are denoted by πT(|;ϕ) and πS(|;θ). In what follows, we adapt this parameterization structure into the notation and interchangeably refer to πS and πT with their associated weights, θ and ϕ.

We consider the simulation optimization setting, where we can sample from the underlying MDP and observe the costs. Consider a possible state–action–cost trajectory τ{x0,a0,c0,x1,a1,c1,,xH-1,aH-1,cH-1,xH} and let 𝒯{τ} to be the set of all possible trajectories under all admissible policies. For simplicity of exposition, we assume that the first hitting time H of a terminal state xterm from any given x and following a stationary policy is bounded almost surely with an upper bound Hmax, i.e., HHmax almost surely. Since the sample trajectories in many RL tasks terminate in finite time, this assumption is not restrictive. For example, the game fails after reaching a certain state or a time-out signal may terminate the trajectory. Along a trajectory τ, the system incurs a discounted cost Jθ(τ)=t=0H-1γtct, with discount factor γ(0,1], and the probability of sampling such a trajectory is θ(τ)=P0(x0)t=0H-1πS(at|xt;θ)P(xt+1|xt,at). We denote the expected cost from state x onward until hitting the terminal state xterm by Vθ(x), i.e.,

Vθ(x)=𝔼θ[t=0H-1γtC(xt,at)|x0=x]. (1)

2.1 Distance Measure

An important question that arises is how to quantify the distance between the policies of the teacher and the student. There are several distance measures studied in the literature for computing the closeness of two probability measures. Among those, the Kullback-Leibler (KL) divergence Nasrabadi (2007) is a widely used metric. In this work, we consider both reverse (KL-R) and forward (KL-F) KL-divergence, defined as

DKL(θϕ)DKL (θ(τ)ϕ(τ))=τ𝒯θ(τ)logθ(τ)ϕ(τ) (KL-R)
DKL(ϕθ)DKL (ϕ(τ)θ(τ))=τ𝒯ϕ(τ)logϕ(τ)θ(τ). (KL-F)

KL-divergence is known to be an asymmetric distance measure, meaning that changing the order of the student and teacher distributions will cause different learning behaviors. We will use the reverse KL-divergence in the theoretical analysis since it provides more compact notations. However, in all of the experiments, we will consider the forward setting, i.e., DKL(ϕθ), unless otherwise specified. Informally speaking, this form of KL-divergence, which is also known as the mean-seeking KL, allows the student to perform actions that are not included in the teacher’s behavior. This is because i) when the teacher can perform an action at in a given state st, the student should also have πS(at|st)>0 to keep the distance finite, and ii) the student can have πS(at|st)>0 irrespective of whether the teacher is doing that action or not. The reverse direction, DKL(θϕ), known as the mode-seeking KL, can be useful as well. For example, let’s assume that the teacher policy is a mixture of several Gaussian sub-policies. Using the reverse order will allow the student to assign only one sub-policy as its decision making policy. Hence, the choice of reverse KL would be preferred if the student wants to find a policy which is close to a teacher’s sub-policy with the highest return. The justification of this behavior is also visible from the definition: when πS(at|st)>0 for a given state and action, then the teacher also should have πT(at|st)>0. Also, πT(at|st)=0 would not allow πS(at|st)>0. For more detailed discussion and examples, we refer the interested reader to Appendix C.1 and Section 10.1.2 of Nasrabadi (2007).

2.2 Optimization Problems

The student’s optimization problems that we would like to solve for reverse KL-divergence (OPT-R) and the forward KL-divergence (OPT-F) are defined as

minθΘVθ(x0)s.t.DKL(θϕ)δ (OPT-R)
minθΘVθ(x0)s.t.DKL(ϕθ)δ, (OPT-F)

where δ is an upper bound on the KL-divergence and Θ is a convex compact set of possible policy parameters. Most of the theoretical analysis of the two optimization problems is quite similar, so we will use (OPT-R) as our main formulation. In Appendix B.3, we investigate the equivalence of both problems and state their minor differences.

The widely adopted problem studied for MDPs only contains the objective function; however, we impose an additional constraint to restrict the student’s policy. By fixing an appropriate value for δ, one can enforce a constraint on the maximum allowed deviation of the student policy from that of the teacher. The objective is to find a set of optimal points θ* that minimizes the discounted expected cost while not violating the KL constraint. Notice that πS=πT is a trivial feasible solution. In addition, we need to have the following assumption to ensure that (OPT-R) is well-defined:

Assumption 1.

(Well-defined (OPT-R)) For any state–action pair (x,a)X×A with πT(x,a)=0, we have πS(x,a)=0.

Intuitively, Assumption 1 specifies that when the teacher does not take a specific action in a given state, the student also cannot choose that action. Even though this assumption might seem restrictive, it is valid in situations in which the student is indeed limited to the positive-probability action space of the teacher. Alternatively, we can certify this assumption by adding a small noise term to the outcome of the teacher’s policy at the expense of some information loss.

2.3 Lagrangian Relaxation of (OPT-R)

The standard method for solving (OPT-R) is by applying Lagrangian relaxation Bertsekas (1999). We define the Lagrangian function

L(θ,λ)Vθ(x0)+λ(DKL(θϕ)-δ), (2)

where λ is the Lagrange multiplier. Then the optimization problem (OPT-R) can be converted into the following problem:

maxλ0minθΘL(θ,λ). (3)

The intuition beyond (3) is that we now allow the student to deviate arbitrarily much from the teacher’s policy in order to decrease the cumulative cost, but we penalize any such deviation.

Next, we define a dynamical system which, as we will prove in Appendix B, solves problem (OPT-R) under several common assumptions for stochastic approximation methods. Once we know the optimal Lagrange multiplier λ*, then the student’s optimal policy is

θ*argminθVθ(x0)+λ*(DKL(θϕ)-δ). (4)

A point (θ*,λ*) is a saddle point of L(θ,λ) if for some r>0, we have

L(θ*,λ)L(θ*,λ*)L(θ,λ*) (5)

for all θΘ𝔹r(θ*) and λ0, where 𝔹r(θ*) represents a ball around θ* with radius r. Then, the saddle point theorem Bertsekas (1999) immediately implies that θ* is the local optimal solution of (OPT-R).

3 Primal–Dual Policy Gradient Algorithm

We propose a primal–dual policy gradient (PDPG) algorithm for solving (OPT-R). Due to space limitation, we leave the detailed algorithm to Appendix A.2, but the overall scheme is as follows. After initializing the student’s policy parameters, possibly with those of the teacher, we sample multiple trajectories under the student’s policy at each iteration k. Then the sampled trajectories are used to calculate the approximate gradient of the Lagrangian function with respect to θ and λ. Finally, using an optimization algorithm, we update θ and λ according to the approximated gradients.

In order to prove our main convergence result, we need some technical assumptions on the student’s policy and step sizes.

Assumption 2.

(Smooth policy) For any (x,a)X×A, πS(a|x;θ) is a continuously differentiable function in θ and its gradient is L-Lipschitz continuous, i.e., for any θ1 and θ2,

θπS(a|x;θ)|θ=θ1-θπS(a|x;θ)|θ=θ2θ1-θ2. (6)
Assumption 3.

(Step-size rules) The step-sizes α1(k) and α2(k) in update rules (15) and (16) satisfy the following relations:

  1. 1.


  2. 2.


  3. 3.


Relations 1 and 2 in Assumption 3 are common in stochastic approximation algorithms, and 3 indicates that the Lagrange multiplier update is in a slower time-scale compared to the policy updates. The latter condition simplifies the convergence proof by allowing us to study the PDPG as a two-time-scale stochastic approximation algorithm. The following theorem states the main theoretical result of this paper.

Theorem 1.

Under Assumptions 1, 2, and 3, the sequence of policy updates (starting from θ0 sufficiently close to a local optimum point θ*) and Lagrange multipliers converges almost surely to a saddle point of the Lagrangian, i.e., (θ(k),λ(k))a.s.(θ*,λ*). Moreover, θ* is a local optimal solution of (OPT-R).


(sketch) The proof is similar to those found in Tamar et al. (2012); Chow et al. (2017). It is based on representing θ and λ update rules with a two-time-scale stochastic approximation algorithm. For each timescale, the algorithm can be shown to converge to the stationary points of the corresponding continuous-time system. Finally, it can be shown that the fixed point is, in fact, a locally optimal point. In Appendix B.1, we provide a formal proof of this theorem. ∎

Corollary 1.

Under Assumptions 1, 2, and 3, the sequence of policy updates and Lagrange multipliers converges globally to a stationary point of the Lagrangian almost surely. Moreover, if θ* is in the interior of Θ, then θ* is a feasible first order stationary point of (OPT-R), i.e., θVθ(x0)|θ=θ*=0 and DKL(θ*ϕ)δ.

Theorem 1 and Corollary 1 are also valid for the forward KL constraint case, as we discuss in Appendix B.3.

4 Practical PDPG Algorithm

Although the algorithm presented in the previous section is proved to converge to a first-order stationary point, it cannot directly serve as a practical learner algorithm. The main reason is that it produces a high-variance approximation of the gradients, which would lead to unstable learning. In this section, we propose several approximations to the theoretically-justified PDPG in order to develop a more practical algorithm. For this algorithm, we will consider the forward definition of KL-divergence due to the mean-covering property.

One source of variance is the reward bias, which can be handled by adding a critic, similar to Konda and Tsitsiklis (2000). Our next adjustment is to use an approximation of the step-wise KL-divergence, defined as

D^KLstep(ϕθ)=1Nj=1Nt=0HDKLstep(πT(|xt;ϕ)πS(|xt;θ)) (7)

where DKLstep(πT(|xt;ϕ)πS(|xt;θ))=a𝒜πT(a|xt;ϕ)logπT(a|xt;ϕ)πS(a|xt;θ). As we discuss in Appendix C.1, using (7) results in a much smaller variance, while still ensuring the convergence results. Intuitively, this equation suggests that instead of computing the trajectory probabilities and then computing the KL-divergence, as in (KL-R), one can compute the KL in every visited state along a trajectory and sum them up. In addition to this change, we will further normalize each DKLstep by its trajectory length H to remove the effect of the variable horizon length. The latter modification will lead to more sensible KL values and will make the choice of δ easier.

A second difficulty with the algorithm in Section 3 is that, unlike conventional policy gradient algorithms, there is no guarantee that the student’s optimal policy is a deterministic one. In fact, in most of our experiments, it happens that the optimal policy is stochastic, especially when the teacher’s policy itself is stochastic. To illustrate this, consider two scenarios: i) The student refuses to do the suggested action of a deterministic teacher. In this case, she would incur an infinite cost as a result of her disobedience, so the problem will be infeasible. ii) The teacher is less informative and has no clue about most of the state space, so often takes random actions. Trying to emulate this teacher would cause degraded performance for the student as well, so the student would also take many less informed actions.

A stochastic optimal policy is usually not desirable since it poses major safety and reliability challenges, so our next adjustments are an attempt to address this issue. One possible mitigation for the first scenario might be using a bounded distance measure such as Hellinger Cramer (1946) instead of KL-divergence, but our numerical experiments did not confirm that this is effective. We observe that by using the Hellinger constraint, the total entropy of the student’s policy stays high, without any improvement in the student’s policy. Instead, we propose using percentile KL-clipping, which is defined as

clipρ(DKLstep)=max{ρ% percentile of all DKLsteps at time t,DKLstep} (8)

In fact, the clip function enables the student to totally disagree with the teacher in ρ% of the visited states, without receiving an extremely large penalty. Selecting the values for ρ depends on our perception about how perfect the teacher is. Setting ρ close to 100 means that we believe in the teacher’s suggestions. As we decrease ρ, we rely less on the teacher and can disobey more freely.

The last major modification is to control the expected entropy at a certain small level δent>0, i.e.,

ent(θ)-𝔼x[1Ht=0Ha𝒜πS(a|x;θ)logπS(a|x;θ)]=δent. (9)

The justification for adding (9) is that we would like the optimal policy to be close to a deterministic one as much as possible. By setting a small value for δent, we can enforce this property. Also, this constraint tries to avoid having a deterministic policy in intermediate training steps, in order to allow more exploration. To add this constraint, we use the same Lagrangian technique, adding an extra term to the Lagrangian function:

L(θ,λ,ζ)Vθ(x0)+λ(DKL(ϕθ)-δ)+ζ(ent(θ)-δent). (10)

All of these modifications, along with a few others, are summarized in Algorithm LABEL:alg:ppg of Appendix LABEL:app:ppg.

5 Experiments

We illustrate the efficiency of the proposed methods with multiple GridWorld experiments. In the first set of experiments, the teacher tries to teach the student to perform an oscillating maneuver around the walls. In the second set, we study how the student can comprehend changes in the environment change and utilize them to increase its rewards.

5.1 Square-Wave Teacher

In this experiment, we consider a teacher who gives a suggestion in every state of a GridWorld. We study two variants of the teacher, one who is very determined about all of his suggestions and the other who is less confident. Figure 1 illustrates the environment and both teachers’ suggestions. A student wants to find a path from the blue state to the green target. Each step has a reward -1 and reaching the target brings +100 reward. If the student wants to act independently, the optimal path is a trivial horizontal line. However, our objective is to force the student to “listen” to her teacher up to some level.

(a) “Determined” teacher with the corresponding suggested actions for every state. The red line shows the teacher’s suggested path to target
(b) “Less confident” teacher with deterministic actions in a subset of states and uniformly random actions in the rest
(c) Optimal Path (in green) versus a sample path found by PDPG (in purple) with δ=0.2 and ρ=8
Figure 1: Two different teachers with suggested actions and optimal path.

Determined Teacher: In this part, a teacher has a preferred action in every state with a probability of around 98%. As we observe in Figure 0(a), these suggestions might help the student in reaching the target, but they are inefficient. For instance, if the student follows a square-wave sequence of actions as illustrated by the red line, she will be able to reach the target while following all of the teacher’s suggestions exactly. Our objective is to allow the student deviate from the teacher for a few steps, to find shorter routes.

By using PDPG, the student is able to find policies that are a mixture of the horizontal path and the square-wave route. For example, in Figure 0(c), we have illustrated an instance of the student’s optimized path with δ=0.2 . The extent to which either policy is followed depends on values of δ and the KL-clipping parameter ρ. Figure 1(a) illustrates the student’s total reward for different δ quantities without KL-clipping. We observe that as we increase δ, we allow the student to act more freely, hence she gets a higher reward. However, after 5000 training iterations, the reward remains at the same level with too much oscillation. Recalling the discussions of Section 4, this behavior indicates convergence to a stochastic policy.

To reduce the oscillating behavior, we proposed adding an entropy constraint and KL-clipping. Figure 1(b) shows how adding the entropy constraint results in a more deterministic (i.e., lower entropy) policy. Also, in Figure 1(c), we have added KL-clipping. As we decrease ρ, the student can totally disagree with the teacher in a larger proportion of the visited states, so she can find better policies with higher rewards. For different values of ρ, we see that the policy can converge to either a stochastic or deterministic one. For ρ=70,75, it converges to a deterministic horizontal line policy. With ρ=80,85, it learns to deterministically follow one -shaped path followed by a horizontal route, and for ρ=95, it follows a -shaped path with a horizontal line at the end. Notice that even with ρ=100, which means no clipping, the student is not exactly following the teacher. We also observe that for ρ=90, it fails to converge to a deterministic policy. One justification for such a failure is that the student’s policy is far better than the less-rewarding deterministic one, but not good enough to get to the next level of performance. Finally, Figure 1(d) shows how the λ and ζ values converge to their optimal values.

Less Confident Teacher: This experiment is designed to illustrate how a less confident teacher can still teach the student to follow some of his suggestions, but it will yield a lower level of confidence of the student. Figure 0(b) shows the suggested actions of the teacher; he is deterministic only in a subset of the states. For the rest, he does not have any information, so he suggests actions uniformly at random. The less confident teacher still has the square wave as the general idea (which is bad, just like the determined teacher), but also has extra randomness that points the student in even worse directions. In other words, the less confident teacher has a worse policy overall than the determined one. Recommending random actions causes the student to have more volatile behavior. We can observe this fact by comparing Figure 2(a) with 1(a), where the student’s converged policy produces a wider range of rewards for the less-confident teacher’s case. Also, the average reward for this case is slightly lower, which can be explained by the inadequate information that the less-confident teacher provides for solving the task.

Figure 2(b) shows that adding KL-clipping helps in reducing the volatility, but one needs to choose a much smaller value for ρ (compare it with Figure 1(c)). Yet, even a small ρ does not necessarily result in a deterministic policy; for ρ as small as 0.4, the student has converged to a stochastic one.

(a) The effect of δ on reward; no KL-clipping
(b) The effect of entropy constraint; δ=0.2
(c) Total reward for different ρ and δ=0.2
(d) Convergence of λ and ζ; δ=0.2
Figure 2: Performance of a student learning from the deterministic teacher
Figure 3: Performance of a student learning from the less confident teacher
Figure 4: Teacher and student environments as well as their policies

5.2 Wall Leaping

The purpose of this experiment is to show that PDPG can act as an improvement method when the student encounters a slightly modified environment. The teacher’s reward structure is similar to the structure in the previous experiment, i.e., -1 for every step and +100 for reaching the target. However, the student comprehends that she can leap over some of the walls with a reward of -2. We use a vanilla policy gradient algorithm to train the teacher, which provides paths like the one illustrated in Figure 3(a). If we allow the student to learn without any constraint, it will find the green path in Figure 4 with a KL-divergence of 0.89. However, this is not what we are looking for since it is extremely different from the teacher. Instead, we use the PDPG algorithm to constrain the policy deviation with δ=0.3. Using this parameter, the student learns to follow the purple path, with a KL-divergence of 0.23.

6 Related Work

Learning from a teacher is a well-studied problem in the literature on supervised learning Girshick et al. (2014) and imitation learning Schaal (1999); Thomaz et al. (2006). However, we are not aware of any work using a teacher to control specific behaviors of a student. The typical use case of a student–teacher framework in RL is in “policy compression,” where the objective is to train a student from a collection of well-trained RL policies. Policy distillation Rusu et al. (2015) and actor–mimic Parisotto et al. (2015) are two methods that distill the trained RL agents, in a supervised learning fashion, into a unified policy of the student. In contrast, we follow a completely distinct objective, where a student is continually interacting with an environment and it only uses the teacher’s signals as a guideline for shaping her policy.

Closest to ours, Schmitt et al. (2018) propose “kickstarting RL,” a method that uses the teacher’s information for better training. Incorporating the idea of population-based training, they design a hand-crafted decreasing schedule of Lagrange multipliers, {λk}0. Nevertheless, the justification for such a schedule is not clearly visible. However, noticing that their problem is a special case of ours with δ=, our findings confirm the credibility of their approach, i.e., our findings indicate that λ*=limkλmink=0 according to strong duality. This observation also conforms with the experimental findings of Schmitt et al. (2018), and our theoretical results indicate that when there is no obligation on being similar to the teacher, the student is better off eventually operating independently. Similarly, their method only uses the teacher for faster learning.

Imposing certain constraints on the behavior of a policy is also a common problem in the context of “safe RL” Achiam et al. (2017); Leike et al. (2017); Chow et al. (2018). Typically, these problems look for policies that avoid hazardous states either during training or execution. Our problem is different in that we follow another type of constraint, yet similar methods might be applied. Using a domain-specific programming language instead of neural networks can be an alternative method to add interpretability Verma et al. (2018), but it lacks the numerous advantages inherent in end-to-end and differentiable learning. In an alternative direction, it is also possible to manipulate the policy shape by introducing auxiliary tasks or reward shaping Jaderberg et al. (2016). Despite the simplicity of the latter approach, it has a very limited capability. For example, it is unclear how reward shaping can suggest directions similar to our square-wave teacher. In summary, we believe that our end-to-end method, by implicitly adding interpretable components, can partially alleviate the concerns related to the RL policies.

7 Concluding Remarks

In this paper, we introduce a new paradigm called corrective RL, which allows a “student” agent to learn to optimize its own policy while also staying sufficiently close to the policy of a “teacher.” Our approach is motivated by the fact that practitioners may be reluctant to adopt the policies proposed by RL algorithms if they differ too much from the status quo. Even if the RL policy produces an impressive expected return, this may not be satisfactory evidence to switch the operation of a billion-dollar company to a policy found by an RL. We believe that corrective RL provides a straightforward remedy by constraining how far the new policy can deviate from the old one or another desired, target policy. Doing so will help reduce the stresses of adopting a novel policy.

We believe that, with further extensions, corrective RL has the potential to address to some of RL’s interpretability challenges. Using more advanced optimization algorithms, studying different distance measures, considering continuous-action problems, and having multiple teachers represent fruitful avenues for future research.


This work was partially supported by the U.S. National Science Foundation, under award numbers NSF:CCF:1618717, NSF:CMMI:1663256 and NSF:CCF:1740796, and XSEDE IRI180020.


  • Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
  • Lillicrap et al. [2015] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • Chen et al. [2018] Shi-Yong Chen, Yang Yu, Qing Da, Jun Tan, Hai-Kuan Huang, and Hai-Hong Tang. Stabilizing reinforcement learning in dynamic environment with application to online recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1187–1196. ACM, 2018.
  • Nazari et al. [2018] MohammadReza Nazari, Afshin Oroojlooy, Lawrence Snyder, and Martin Takac. Reinforcement learning for solving the vehicle routing problem. In Advances in Neural Information Processing Systems, pages 9861–9871, 2018.
  • Feng et al. [2017] Shuo Feng, Peyman Setoodeh, and Simon Haykin. Smart home: Cognitive interactive people-centric internet of things. IEEE Communications Magazine, 55(2):34–39, 2017.
  • Gijsbrechts et al. [2018] Joren Gijsbrechts, Robert N Boute, Jan A Van Mieghem, and Dennis Zhang. Can deep reinforcement learning improve inventory management? performance and implementation of dual sourcing-mode problems. Performance and Implementation of Dual Sourcing-Mode Problems (December 17, 2018), 2018.
  • Oroojlooyjadid et al. [2017] Afshin Oroojlooyjadid, MohammadReza Nazari, Lawrence Snyder, and Martin Takáč. A deep q-network for the beer game with partial information. arXiv preprint arXiv:1708.05924, 2017.
  • Gu et al. [2017] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In IEEE International Conference on Robotics and Automation (ICRA), pages 3389–3396, 2017.
  • Snyder and Shen [2019] Lawrence V Snyder and Zuo-Jun Max Shen. Fundamentals of supply chain theory, 2th Edition. John Wiley & Sons, 2019.
  • Nasrabadi [2007] Nasser M Nasrabadi. Pattern recognition and machine learning. Journal of Electronic Imaging, 16(4):049901, 2007.
  • Bertsekas [1999] Dimitri P Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.
  • Tamar et al. [2012] Aviv Tamar, Dotan Di Castro, and Shie Mannor. Policy gradients with variance related risk criteria. In Proceedings of the twenty-ninth International Conference on Machine Learning, pages 387–396, 2012.
  • Chow et al. [2017] Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research, 18(167):1–167, 2017.
  • Konda and Tsitsiklis [2000] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in Neural Information Processing Systems, pages 1008–1014, 2000.
  • Cramer [1946] Harold Cramer. Mathematical methods of statistics, princeton univ. Press, Princeton, NJ, 1946.
  • Girshick et al. [2014] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 580–587, 2014.
  • Schaal [1999] Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in Cognitive Sciences, 3(6):233–242, 1999.
  • Thomaz et al. [2006] Andrea Lockerd Thomaz, Cynthia Breazeal, et al. Reinforcement learning with human teachers: Evidence of feedback and guidance with implications for learning performance. In Aaai, volume 6, pages 1000–1005. Boston, MA, 2006.
  • Rusu et al. [2015] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. arXiv preprint arXiv:1511.06295, 2015.
  • Parisotto et al. [2015] Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342, 2015.
  • Schmitt et al. [2018] Simon Schmitt, Jonathan J Hudson, Augustin Zidek, Simon Osindero, Carl Doersch, Wojciech M Czarnecki, Joel Z Leibo, Heinrich Kuttler, Andrew Zisserman, Karen Simonyan, et al. Kickstarting deep reinforcement learning. arXiv preprint arXiv:1803.03835, 2018.
  • Achiam et al. [2017] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 22–31. JMLR. org, 2017.
  • Leike et al. [2017] Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg. Ai safety gridworlds. arXiv preprint arXiv:1711.09883, 2017.
  • Chow et al. [2018] Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. In Advances in Neural Information Processing Systems, pages 8092–8101, 2018.
  • Verma et al. [2018] Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri. Programmatically interpretable reinforcement learning. arXiv preprint arXiv:1804.02477, 2018.
  • Jaderberg et al. [2016] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
  • Bhatnagar et al. [2009] Shalabh Bhatnagar, Richard S Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor-critic algorithms. Automatica, 45(11), 2009.
  • Borkar [2009] Vivek S Borkar. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009.
  • Bertsekas [2009] Dimitri P Bertsekas. Convex optimization theory. Athena Scientific Belmont, 2009.
  • Slotine et al. [1991] Jean-Jacques E Slotine, Weiping Li, et al. Applied nonlinear control, volume 199. Prentice hall Englewood Cliffs, NJ, 1991.
  • Teh et al. [2017] Yee Teh, Victor Bapst, Wojciech M Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pages 4496–4506, 2017.
  • Ghosh et al. [2017] Dibya Ghosh, Avi Singh, Aravind Rajeswaran, Vikash Kumar, and Sergey Levine. Divide-and-conquer reinforcement learning. arXiv preprint arXiv:1711.09874, 2017.
  • Liu et al. [2018] Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pages 5361–5371, 2018.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Bello et al. [2016] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016.

Appendix A PDPG Algorithms

A.1 Computing the Gradients

The Lagrangian function in the optimization problem (3) can be re-written as

L(θ,λ)= τ𝒯θ(τ)J(τ)+λτ𝒯θ(τ)logθ(τ)ϕ(τ)-λδ
= τ𝒯θ(τ)(J(τ)+λlogθ(τ)ϕ(τ))-λδ. (11)

Recall that 𝒯 is the set of all trajectories under all admissible policies. By taking the gradient of L(θ,λ) with respect to θ, we have:

θL(θ,λ) =τ𝒯θθ(τ)(J(τ)+λlogθ(τ)ϕ(τ))+θ(τ)(λθlogθ(τ))
=𝔼𝒯[θlogθ(τ)(J(τ)+λlogθ(τ)ϕ(τ)+λ)], (12)

and the term θlogθ(τ) can be simplified as

θlogθ(τ) =θ(logP0(x0)+t=0H-1logP(xt+1|xt,at)+t=0H-1logπS(at|xt;θ))
=t=0H-1θπS(at|xt;θ)πS(at|xt;θ). (13)

The gradient of L(θ,λ) with respect to λ is

λL(θ,λ) =τ𝒯θ(τ)logθ(τ)ϕ(τ)-δ=DKL(θ(τ)ϕ(τ))-δ. (14)

By using a set of sample trajectories {τj,j=1,,N} generated under the student policy, one can approximate the gradients (12) and (14) as

θL(θ,λ) 1Nj=1N[θlogθ(τj)(J(τj)+λlogθ(τj)ϕ(τj)+λ)],
λL(θ,λ) D^KL(θϕ)-δ=1Nj=1Nlogθ(τ)ϕ(τ)-δ,

which are the update rules that will be used later on, in (15) and (16).

A.2 PDPG Algorithm

Having derived the gradients of the Lagrangian (in Appendix A.1), we have all the necessary information for proposing our primal–dual policy gradient (PDPG) algorithm, which is described in Algorithm A.2.


[htbp] Primal-Dual Policy Gradient (PDPG) Algorithm for (OPT-R) \[email protected]@algorithmic[1] \STATEinput: teacher’s policy with weights ϕ \STATEinitialize: student’s policy with θ0, possibly equal to ϕ; initialize step size schedules α1() and α2() \WHILETRUE \FORk=0,1, \STATEfollowing policy θk, generate a set of N trajectories 𝒯k={τjk,j=1,2,,N}, each starting from an initial state x0P0() \STATE(θ-update) update θk according to

θk+1=ΓΘ[θk-α1(k)(1Nj=1Nθlogθ(τjk)|θ=θk(J(τjk)+λklogθ(τjk)ϕ(τjk)+λk))] (15)

(λ-update) update λk according to

λk+1=ΓΛ[λk+α2(k)(1Nj=1Nlogθ(τjk)ϕ(τjk)-δ)] (16)

λk converges to λmax \STATEλmax2λmax \ELSE \STATEreturn θ and λ; break


After initializing the student with the teacher’s policy, at each iteration k, we take a mini-batch of sample trajectories under the student’s policy θk. In step A.2, we use the sampled trajectories to compute an approximate gradient of the Lagrangian function with respect to θ and update the policy parameters in the negative direction of the approximate gradient with step size α1(k). In addition to policy parameter updates, the dual variables are learned concurrently using the recursive formula

λk+1=λk+α2(k)[D^KL(θϕ)-δ], (17)

where α2(k) represents the associated step-size rule.

In this algorithm, we need to use two projection operators to ensure the convergence of the algorithm. Specifically, ΓΘ is an operator that projects θ to the closest point in Θ, i.e., ΓΘ(θ)=argminθ^Θθ-θ^2. Similarly, ΓΛ is an operator that maps λ to the interval Λ[0,λmax]. Finally, in steps A.2A.2, we check whether λ has converged to some point on the boundary. Such a convergence means that the projection space for the Lagrange multipliers is small, so we increment the upper bound and repeat searching for a better policy.

Appendix B Convergence Analysis of PDPG for (OPT-R)

Before starting the proof of Theorem 1, noting the definition of θL(θ,λ) and λL(θ,λ), one can make the following observations:

Lemma 1.

Under Assumption 2, the following holds:

  1. 1.

    θlogθ(τ) is Lipschitz continuous in θ, which further implies that

    θlogθ(τ)2κ1(τ)(1+θ2) (18)

    for some κ1(τ)<.

  2. 2.

    θL(θ,λ) is Lipschitz continuous in θ, which further implies that

    θL(θ,λ)2κ2(1+θ2) (19)

    for some constant κ2<.

  3. 3.

    λL(θ,λ) is Lipschitz continuous in λ.


Recall from (13) that θlogθ(τ)=t=0H-1θπS(at|xt;θ)/πS(at|xt;θ) whenever πS(at|xt;θ)>ψ for all t and for some ψ>0. Assumption 2 indicates that θπS(at|xt;θ) is -Lispchitz continuous in θ. Then using the fact the sum of the product of (bounded) Lipschitz functions is Lipschitz itself, one can conclude the Lipschitz continuity of θlogθ(τ), and we denote by L1 its finite Lipschitz constant. Also, noting that H< w.p. 1, then θlogθ(τ)< w.p. 1. The Lipschitz continuity implies that for any fixed θ0Θ,

θlogθ(τ)θlogθ(τ)|θ=θ0+L1θ-θ0K1(τ)(1+θ). (20)

The first inequality follows from the linear growth condition of Lipschitz functions and the last one holds for a suitable value of K1(τ)max{L1,θlogθ(τ)|θ=θ0+L1θ0}<. Taking the square of both sides of (20) yields (18) with κ1(τ)2(K1(τ))2<.

Since θ(τ) and logθ(τ) are continuously differentiable in θ whenever θ(τ)>0, the Lipschitz continuity of θL(θ,λ) can be investigated, from its definition (12), as the sums of products of (bounded) Lipschitz functions. From the definition (12), and recalling Assumption 1 and the compactness of Θ, one can verify the validity of (19) with

κ2=𝔼τ[κ1(τ)(Cmax1-γ+λmaxmaxθΘlogθ(τ)ϕ(τ))]<. (21)

Finally, 3 immediately follows from the fact that λL(θ,λ) is a constant function of λ. ∎

B.1 Convergence of PDPG Algorithm

We use the standard procedure for proving the convergence of the PDPG algorithm. The proof steps are common for stochastic approximation methods and we refer the reader to Chow et al. [2017], Bhatnagar et al. [2009] and references therein for more details. We summarize the scheme of the proof in the following steps:

  1. 1.

    Tracking o.d.e.: Under Assumption 3, one can view the PDPG as a two-time-scale stochastic approximation method. Then, using the results of Section 6 of Borkar [2009], we show that the sequence of (θk,λk) converges almost surely to a stationary point (θ*,λ*) of the corresponding continuous-time dynamical system.

  2. 2.

    Lyapunov Stability: By using Lyapunov analysis, we show that the continuous-time system is locally asymptotically stable at a first-order stationary point.

  3. 3.

    Saddle Point Analysis: Since we have used the Lagrangian as the Lyapunov function, it implies the system is stable in the stationary point of the Lagrangian, which is, in fact, a local saddle point. Finally, we show that with an appropriate initial policy, the policy converges to a local optimal solution θ* for the OPT-R.

First, let us denote by ΨΞ[f(ξ)] the right directional derivative of ΓΞ(ξ) in the direction of f(ξ), defined as


for any compact set Ξ and ξΞ.

Since θ converges on a faster time-scale than λ by Assumption 3, one can write the θ-update rule (15) with a relation that is invariant to λ:


Consider the continuous-time dynamics of θΘ defined as

θ˙=ΨΘ[-θL(θ,λ)], (22)

where by using the right directional derivative ΨΘ[-θL(θ,λ)] in the gradient descent algorithm for θ, the gradient will point in the descent direction of L(θ,λ) along the boundary of Θ (denoted by Θ) whenever the θ-update hits the boundary. We refer the interested reader to Section 5.4 of Borkar [2009] for discussions about the existence of the limit in (22).

Since λ converges in the slowest time-scale, the λ-update rule (16) can be re-written for a converged value θ*(λ) as


Consider the continuous-time dynamics corresponding to λ, i.e.

λ˙=ΨΛ[λL(θ,λ)], (23)

where by using ΨΛ[λL(θ,λ)] in the gradient ascent algorithm, the gradient will point in the ascent direction along the boundary of Λ (denoted by Λ) whenever the λ-update hits the boundary.

We prove Theorem 1 next.


Convergence of the θ-update: First, we need to show that the assumptions of Lemma 1 in Chapter 6 of Borkar [2009] hold for the θ-update and an arbitrary value of λ. Let us justify these assumptions: (i) the Lipschitz continuity follows from Lemma 1, and (ii) the step-size rules follow from Assumption 3. (iii) For an arbitrary value λ, one can write the θ-update as a stochastic approximation, i.e.,

θk+1=ΓΘ[θk+α1(k)(-θL(θ,λ)|θ=θk+Mθk+1)], (24)


Mθk+1=θL(θ,λ)|θ=θk-1Nj=1Nθlogθ(τjk)|θ=θk(J(τjk)+λklogθ(τjk)ϕ(τjk)+λk). (25)

For Mθk+1 to be a Martingale difference error term, we need to show that its expectation with respect to the filtration θk=σ(θm,Mθm,mk) is zero and that it is square integrable with 𝔼[Mθk+12|θk]κk(1+θk2) for some κk. Since the trajectories 𝒯k are generated from the probability mass function θk(), it immediately follows that 𝔼[Mθk+1|θk]=0. Also, we have:



Dmaxk =max1jNlogθ(τjk)ϕ(τjk),and
κk =2κ2k+2NNmax1jNκ1k(τjk)(Cmax1-γ+λmax(Dmaxk+1))2<.

The first and second inequality uses the relation i=1Nai22N-1(i=1Na2). Also, the second one uses the results of Lemma 1. Finally, the boundedness of κk follows from Assumption 1 and having κ1k<, κ2k(τjk)< w.p. 1. Finally, (iv) supkθk< almost surely, because all θk are within the compact set Θ. Hence, by Theorem 2 of Chapter 2 in Borkar [2009], the sequence {θk} converges almost surely to a (possibly sample path dependent) internally chain transitive invariant set of o.d.e. (22).

For a given λ, define the Lyapunov function

λ(θ)=L(θ,λ)-L(θ*,λ), (26)

where θ*Θ is a local minimum point. For the sake of simplifying the proof, let us consider that θ* is an isolated local minimum point, i.e., there exists r such that for all θ𝔹r(θ*), λ(θ)>λ(θ*). This means that the Lyapunov function λ(θ) is locally positive definite, i.e., λ(θ*)=0 and λ(θ)>0 for 𝔹r{θ*}.

If we establish the negative semi-definiteness of dλ(θ)/dt0, then we can use the Lyapunov stability theorems to show the convergence of the dynamical system. Consider the time derivative of corresponding continuous-time system for θ, i.e.,

dλ(θ)dt=dL(θ,λ)dt=(θL(θ,λ))TΨΘ(-θL(θ,λ)). (27)

Consider two cases:

  1. 1.

    For a fixed θ0Θ, there exists α0>0 such that the update θ0-αθL(θ,λ)|θ=θ0Θ for all α(0,α0]. In this case, ΨΘ(-θL(θ,λ))=-θL(θ,λ), which further implies that


    and this quantity is non-zero as long as ΨΘ(-θL(θ,λ))0.

  2. 2.

    For fixed θ0Θ and any α0>0, there exists α(0,α0] such that θαθ0-αθL(θ,λ)|θ=θ0Θ. The projection ΓΘ(θα)=argminθΘ12θ-θα2 maps θα to a point in Θ. This projection is single-valued because of the compactness and convexity of Θ, and we denote the projected point by θ¯αΘ. Consider α0, then

    (θL(θ,λ))TΨΘ(-θL(θ,λ)) =limα0(θ-θα)T(θ¯α-θ)η

    where the last inequality follows from the Projection Theorem (see Proposition 1.1.9 of Bertsekas [2009]). Again, one can verify that the time-derivative quantity is non-zero as long as ΨΘ(-θL(θ,λ))0.

In summary, dλ(θ)/dt0 and this quantity is nonzero as long as ΨΘ(-θL(θ,λ))0. Then by LaSalle’s Local Invariant Set Theorem (see, e.g., Theorem 3.4 of Slotine et al. [1991]), we conclude that the dynamical system tends to the largest positive invariant set within 𝕄θ{θ:ΨΘ(-θL(θ,λ))=0}. Notice that θ*𝕄θ. Let l>0 be equal to min{λ(θ):ΨΘ(-θL(θ,λ))=0,θ𝔹r(θ*)θ*}. Then every trajectory starting from the attraction region {θ𝔹r(θ*)|λ(θ)<l} will tend to the local minimum θ*. Since we chose θ* to be arbitrary, this holds for all local minima. Hence, using Corollary 4 of Chapter 2 in Borkar [2009], we conclude that if the initial policy θ0 is within the attraction region of a local minimum point θ*, then it will converge to it almost surely.


The case in which θ* is not isolated can be handled similarly, with the minor difference that the convergence happens to a set of optimal points instead of to a single point.

Convergence of the λ-update: We need to show that the assumptions of Theorem 2 in Chapter 6 of Borkar [2009] hold for the two-time-scale stochastic approximation theory. Let us verify the validity of these assumptions: (i) λL(θ,λ) is a Lipschitz function in λ from Lemma 1, and (ii) step-size rules follow from Assumption 3. (iii) Since λ converges in a slower time-scale, we have θk,i-θ*(λk)0 almost surely as i, which, according to the Lipschitz continuity of λL(θ,λ), implies that

λL(θ,λ)|θ=θk,i,λ=λk-λL(θ,λ)|θ=θ*(λk),λ=λk0 as i. (28)

Hence the λ-update can be written as



Mλk+1=-λL(θ,λ)|θ=θ*(λk),λ=λk+(1Nj=1Nlogθ*(λk)(τjk)ϕ(τjk)-δ). (29)

From (29), we can verify that 𝔼[Mλk+1|λk]=0, where λk=σ(λm,Mλm,mk) is a filtration of λ generated by different independent trajectories. Also, we have:


Hence, Mλk+1 is a Martingale difference error. Also, (v) sup{λk}<. Recall that from the convergence analysis of the θ-update for a λk, we know that θ*(λk) is an asymptotically stable point. Then by Theorem 2 of Chapter 6 in Borkar [2009], we can conclude that (θk,λk) converges almost surely to (θ*(λ*),λ*), where λ* belongs to an internally chain transitive invariant set of (23).

Define the Lyapunov function:


where λ* is a local maximum point, i.e., there exists r such that for any λr(λ*), the Lyapunov function (λ) is positive definite. We can follow similar lines of arguments as we did for the θ-update to show that d(λ)dt0 and this quantity is non-zero as long as ΨΛ(-λL(θ*(λ),λ))0. Then by using the results of LaSalle’s Local Invariant Set Theorem, we can establish the convergence of the dynamical system to the largest invariant set within 𝕄λ{λ:ΨΛ(-λL(θ*(λ),λ))=0}. This means that λ*𝕄λ is a stationary point. Let l=min{(λ):ΨΛ(-λL(θ*(λ),λ))=0,λr(λ*)λ*}. Then, every trajectory starting with λ0 in {λr(λ*):(λ)<l} will tend to λ* w.p.1.

Saddle Point Analysis: By denoting θ*=θ*(λ*), we want to show that (θ*,λ*) is, in fact, a saddle point of the Lagrangian L(θ,λ). Recall that, as we proved in the convergence the of θ-update, θ* is a local minimum of L(θ,λ) within a sufficiently small ball around itself, i.e., there exists r>0 such that

L(θ*,λ*)L(θ,λ*),θΘr(θ*). (30)

It is easy to verify that θ* is a feasible solution of (OPT-R) whenever λ*[0,λmax), i.e.

DKL(θ*ϕ)δ. (31)

To show this, assume for a contradiction that DKL(θ*ϕ)-δ>0. Then,

ΨΛ[λL(θ,λ)|θ=θ*,λ=λ*] =limα0ΓΛ[λ*+αλL(θ,λ)|θ=θ*,λ=λ*]-ΓΛ[λ*]α

which contradicts the fact that ΨΛ[λL(θ,λ)|θ=θ*,λ=λ*]=0. Notice that the feasibility cannot be verified when λ*=λmax, because ΨΛ[λL(θ,λ)|θ=θ*(λmax),λ=λmax]=0 when DKL(θ*ϕ)>δ. In this case, we increase λmax (e.g., we set λmax2λmax in our algorithm) if such a behavior happens until it converges to an interior point of [0,λmax].

In addition, the complementary slackness condition

λ*(DKL(θ*ϕ)-δ)=0 (32)

holds. To show this, we only need to verify that DKL(θ*ϕ)<δ yields λ*=0. For a contradiction, suppose that λ*(0,λmax). Then, we have

ΨΛ[λL(θ,λ)|θ=θ*(λ*),λ=λ*] =DKL(θ*ϕ)-δ<0,

which contradicts the fact that ΨΛ[λL(θ,λ)|θ=θ*,λ=λ*]=0, meaning that λ*=0 in this case. Hence, we have:

L(θ*,λ*) =Vθ*(x0)+λ*(DKL(θ*ϕ)-δ)
Vθ*(x0)+λ(DKL(θ*ϕ)-δ)=L(θ*,λ). (33)

From (30) and (33), we observe that (θ*,λ*) is a saddle point of L(θ,λ), so according to the saddle point theorem, θ* is a local minimum of (OPT-R). Recall that the result of Theorem 1 depends on the initial values for θ0 and λ0, so the convergence to a local minimum is sample path depenedant. ∎

B.2 Proof of Corollary 1


From the convergence analysis of the θ-update, we know that {θk} converges almost surely to the largest invariant set within 𝕄θ, and similarly, {λk} converges almost surely to the largest invariant set within 𝕄λ. We also know from (31) that θ* is a feasible point of (OPT-R). When λ*=0, then L(θ*,λ*)=Vθ*(x0). Also, for λ*>0, the complementary slackness condition (32) implies DKL(θ*ϕ)=δ. Hence θDKL(θϕ)|θ=θ*=0, which in turn, means that

θL(θ,λ*)|θ=θ*=θVθ(x0)|θ=θ*+λ*θDKL(θϕ)|θ=θ*=θVθ(x0)|θ=θ*. (34)

Hence, for a θ* located in the interior of Θ, we have θL(θ,λ*)|θ=θ*=θVθ(x0)|θ=θ*=0, so it is a first-order stationary point of (OPT-R). However, if θ*Θ, it is possible to have θL(θ,λ*)|θ=θ*0. ∎


In practice, we choose the projection set Θ large enough so that the latter case (convergence to boundary) will not happen. For example, assuring that the weights of a neural network do not diverge is a sufficient criterion to use instead of the projection operator ΓΘ.

B.3 Equivalent Results for (OPT-F)

A similar PDPG algorithm to the one proposed in Algorithm A.2 can solve (OPT-F), only requiring a slight modification of rules (15) and (16) as

θk+1 =ΓΘ[θk-α1(k)(1Nj=1Nθlogθ(τjk)|θ=θk(J(τjk)+λkIS(τjk)logϕ(τjk)θ(τjk)-λk))]
λk+1 =ΓΛ[λk+α2(k)(1Nj=1NIS(τjk)logϕ(τjk)θ(τjk)-δ)],

where IS(τjk)=ϕ(τjk)/θ(τjk) is the importance sampling weight added to account for the bias introduced by sampling under the student’s policy. To ensure a well-defined (OPT-F), we need the following assumption:

Assumption 4.

Well-defined (OPT-F): for any state–action pair (x,a)X×A with πS(x,a)=0, we have πT(x,a)=0.

This assumption ensures a similar criterion to that of Assumption 1, but notice that in this case, the student might take any action, regardless of the teacher’s policy. Exactly the same steps can be taken, virtually verbatim, to prove the following convergence property of the PDPG algorithm for (OPT-F).

Theorem 2.

Under Assumptions 2, 3, and 4, the sequence of policy updates (starting from θ0 sufficiently close to a local optimum point θ*) and Lagrange multipliers converges almost surely to a saddle point of the Lagrangian, i.e., (θ(k),λ(k))a.s.(θ*,λ*). Then, θ* is the local optimal solution of (OPT-F).

Appendix C Practical PDPG Algorithm

A naive implementation of Algorithm A.2 would result in a high-variance training procedure. In this section, we discuss several techniques for variance reduction, resulting in a more stable algorithm compared to the one proposed in Algorithm A.2.

C.1 Step-wise KL-divergence Measure

In the policy distillation literature, some studies use a trajectory-wise KL-divergence (KL-F) as the distance metric Teh et al. [2017], but the step-wise KL-divergence between the distribution is also common Ghosh et al. [2017], which is defined as:

DKLstep(ϕθ)=𝔼xdπT[DKL(πT(|x;ϕ)πS(|x;θ))], (35)


DKL(πT(|x;ϕ)πS(|x;θ))=a𝒜πT(a|x;ϕ)logπT(a|x;ϕ)πS(a|x;θ). (36)

In the next proposition, we explore the relations between these two methods.

Proposition 1.

The following relation holds between the trajectory-wise and step-wise KL-divergence metrics:

DKL(ϕθ)𝔼[H]DKLstep(ϕθ) (37)

According to the definition of trajectory-wise KL-divergence, we have:

DKL(ϕ(τ)||θ(τ)) =τϕ(τ)logϕ(τ)θ(τ)

Here, 𝕀t(τ;x,a) is the indicator of whether (xt=x,at=a) occurs along trajectory τ. Also, dπT(x) is the distribution of being in state x under policy πT, defined as


and dt,πT(x) is the probability of being in x at time t under policy πT. ∎

According to this proposition, the step-wise KL distances can be used to provide an upper bound on the trajectory-wise one. In other words, if the step-wise KL multiplied by the expected horizon length are less than δ, then it is also correct for the trajectory-wise one.