Abstract
We consider the problem of reinforcement learning when provided with abaseline control policy and a set of constraints that the controlled systemmust satisfy. The baseline policy might arise from a heuristic, a priorapplication, a teacher or demonstrator data. The constraints might encodesafety, fairness or some application-specific requirements. We want toefficiently use reinforcement learning to adapt the baseline policy to improveperformance and satisfy the given constraints when it is applied to the newsystem. The key challenge is to effectively use the baseline policy (which neednot satisfy the current constraints) to aid the learning of aconstraint-satisfying policy in the new application. We propose an iterativealgorithm for solving this problem. Each iteration is composed of three-steps.The first step performs a policy update to increase the expected reward, thesecond step performs a projection to minimize the distance between the currentpolicy and the baseline policy, and the last step performs a projection ontothe set of policies that satisfy the constraints. This procedure allows thelearning process to leverage the baseline policy to achieve faster learningwhile improving reward performance and satisfying the constraints imposed onthe current problem. We analyze the convergence of the proposed algorithm andprovide a finite-sample guarantee. Empirical results demonstrate that thealgorithm can achieve superior performance, with 10 times fewer constraintviolations and around 40% higher reward compared to state-of-the-art methods.