Safe Reinforcement Learning Using Advantage-Based Intervention

Abstract

Many sequential decision problems involve finding a policy that maximizestotal reward while obeying safety constraints. Although much recent researchhas focused on the development of safe reinforcement learning (RL) algorithmsthat produce a safe policy after training, ensuring safety during training aswell remains an open problem. A fundamental challenge is performing explorationwhile still satisfying constraints in an unknown Markov decision process (MDP).In this work, we address this problem for the chance-constrained setting. Wepropose a new algorithm, SAILR, that uses an intervention mechanism based onadvantage functions to keep the agent safe throughout training and optimizesthe agent's policy using off-the-shelf RL algorithms designed for unconstrainedMDPs. Our method comes with strong guarantees on safety during both trainingand deployment (i.e., after training and without the intervention mechanism)and policy performance compared to the optimal safety-constrained policy. Inour experiments, we show that SAILR violates constraints far less duringtraining than standard safe RL and constrained MDP approaches and converges toa well-performing policy that can be deployed safely without intervention. Ourcode is available at https://github.com/nolanwagener/safe_rl.

Quick Read (beta)

loading the full paper ...