Abstract
We study the safe reinforcement learning problem with nonlinear functionapproximation, where policy optimization is formulated as a constrainedoptimization problem with both the objective and the constraint being nonconvexfunctions. For such a problem, we construct a sequence of surrogate convexconstrained optimization problems by replacing the nonconvex functions locallywith convex quadratic functions obtained from policy gradient estimators. Weprove that the solutions to these surrogate problems converge to a stationarypoint of the original nonconvex problem. Furthermore, to extend our theoreticalresults, we apply our algorithm to examples of optimal control and multi-agentreinforcement learning with safety constraints.
Quick Read (beta)
loading the full paper ...