Penalized Proximal Policy Optimization for Safe Reinforcement Learning

Abstract

Safe reinforcement learning aims to learn the optimal policy while satisfyingsafety constraints, which is essential in real-world applications. However,current algorithms still struggle for efficient policy updates with hardconstraint satisfaction. In this paper, we propose Penalized Proximal PolicyOptimization (P3O), which solves the cumbersome constrained policy iterationvia a single minimization of an equivalent unconstrained problem. Specifically,P3O utilizes a simple-yet-effective penalty function to eliminate costconstraints and removes the trust-region constraint by the clipped surrogateobjective. We theoretically prove the exactness of the proposed method with afinite penalty factor and provide a worst-case analysis for approximate errorwhen evaluated on sample trajectories. Moreover, we extend P3O to morechallenging multi-constraint and multi-agent scenarios which are less studiedin previous work. Extensive experiments show that P3O outperformsstate-of-the-art algorithms with respect to both reward improvement andconstraint satisfaction on a set of constrained locomotive tasks.

Quick Read (beta)

loading the full paper ...