Abstract
Constrained reinforcement learning has achieved promising progress insafety-critical fields where both rewards and constraints are considered.However, constrained reinforcement learning methods face challenges in strikingthe right balance between task performance and constraint satisfaction and itis prone for them to get stuck in over-conservative or constraint violatinglocal minima. In this paper, we propose Adversarial Constrained PolicyOptimization (ACPO), which enables simultaneous optimization of reward and theadaptation of cost budgets during training. Our approach divides originalconstrained problem into two adversarial stages that are solved alternately,and the policy update performance of our algorithm can be theoreticallyguaranteed. We validate our method through experiments conducted on SafetyGymnasium and quadruped locomotion tasks. Results demonstrate that ouralgorithm achieves better performances compared to commonly used baselines.