Abstract
In recent years, Reinforcement Learning (RL) has been applied to real-worldproblems with increasing success. Such applications often require to putconstraints on the agent's behavior. Existing algorithms for constrained RL(CRL) rely on gradient descent-ascent, but this approach comes with a caveat.While these algorithms are guaranteed to converge on average, they do notguarantee last-iterate convergence, i.e., the current policy of the agent maynever converge to the optimal solution. In practice, it is often observed thatthe policy alternates between satisfying the constraints and maximizing thereward, rarely accomplishing both objectives simultaneously. Here, we addressthis problem by introducing Reinforcement Learning with OptimisticAscent-Descent (ReLOAD), a principled CRL method with guaranteed last-iterateconvergence. We demonstrate its empirical effectiveness on a wide variety ofCRL problems including discrete MDPs and continuous control. In the process weestablish a benchmark of challenging CRL problems.