Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning

Abstract

Constrained Reinforcement Learning (CRL) tackles sequential decision-makingproblems where agents are required to achieve goals by maximizing the expectedreturn while meeting domain-specific constraints, which are often formulated asexpected costs. In this setting, policy-based methods are widely used sincethey come with several advantages when dealing with continuous-controlproblems. These methods search in the policy space with an action-based orparameter-based exploration strategy, depending on whether they learn directlythe parameters of a stochastic policy or those of a stochastic hyperpolicy. Inthis paper, we propose a general framework for addressing CRL problems viagradient-based primal-dual algorithms, relying on an alternate ascent/descentscheme with dual-variable regularization. We introduce an exploration-agnosticalgorithm, called C-PG, which exhibits global last-iterate convergenceguarantees under (weak) gradient domination assumptions, improving andgeneralizing existing results. Then, we design C-PGAE and C-PGPE, theaction-based and the parameter-based versions of C-PG, respectively, and weillustrate how they naturally extend to constraints defined in terms of riskmeasures over the costs, as it is often requested in safety-critical scenarios.Finally, we numerically validate our algorithms on constrained controlproblems, and compare them with state-of-the-art baselines, demonstrating theireffectiveness.

Quick Read (beta)

loading the full paper ...