Abstract
Constrained Markov Decision Process (CMDP) is a natural framework forreinforcement learning tasks with safety constraints, where agents learn apolicy that maximizes the long-term reward while satisfying the constraints onthe long-term cost. A canonical approach for solving CMDPs is the primal-dualmethod which updates parameters in primal and dual spaces in turn. Existingmethods for CMDPs only use on-policy data for dual updates, which results insample inefficiency and slow convergence. In this paper, we propose a policysearch method for CMDPs called Accelerated Primal-Dual Optimization (APDO),which incorporates an off-policy trained dual variable in the dual updateprocedure while updating the policy in primal space with on-policy likelihoodratio gradient. Experimental results on a simulated robot locomotion task showthat APDO achieves better sample efficiency and faster convergence thanstate-of-the-art approaches for CMDPs.