Constrained Variational Policy Optimization for Safe Reinforcement Learning

Abstract

Safe reinforcement learning (RL) aims to learn policies that satisfy certainconstraints before deploying them to safety-critical applications. Previousprimal-dual style approaches suffer from instability issues and lack optimalityguarantees. This paper overcomes the issues from the perspective ofprobabilistic inference. We introduce a novel Expectation-Maximization approachto naturally incorporate constraints during the policy learning: 1) a provableoptimal non-parametric variational distribution could be computed in closedform after a convex optimization (E-step); 2) the policy parameter is improvedwithin the trust region based on the optimal variational distribution (M-step).The proposed algorithm decomposes the safe RL problem into a convexoptimization phase and a supervised learning phase, which yields a more stabletraining performance. A wide range of experiments on continuous robotic tasksshows that the proposed method achieves significantly better constraintsatisfaction performance and better sample efficiency than baselines. The codeis available at https://github.com/liuzuxin/cvpo-safe-rl.

Quick Read (beta)

loading the full paper ...