Don't Forget Your Teacher: A Corrective Reinforcement Learning Framework

Abstract

Although reinforcement learning (RL) can provide reliable solutions in manysettings, practitioners are often wary of the discrepancies between the RLsolution and their status quo procedures. Therefore, they may be reluctant toadapt to the novel way of executing tasks proposed by RL. On the other hand,many real-world problems require relatively small adjustments from the statusquo policies to achieve improved performance. Therefore, we propose astudent-teacher RL mechanism in which the RL (the "student") learns to maximizeits reward, subject to a constraint that bounds the difference between the RLpolicy and the "teacher" policy. The teacher can be another RL policy (e.g.,trained under a slightly different setting), the status quo policy, or anyother exogenous policy. We formulate this problem using a stochasticoptimization model and solve it using a primal-dual policy gradient algorithm.We prove that the policy is asymptotically optimal. However, a naiveimplementation suffers from high variance and convergence to a stochasticoptimal policy. With a few practical adjustments to address these issues, ournumerical experiments confirm the effectiveness of our proposed method inmultiple GridWorld scenarios.

Quick Read (beta)

loading the full paper ...