Abstract
Deep Actor-Critic algorithms, which combine Actor-Critic with deep neuralnetwork (DNN), have been among the most prevalent reinforcement learningalgorithms for decision-making problems in simulated environments. However, theexisting deep Actor-Critic algorithms are still not mature to solve realisticproblems with non-convex stochastic constraints and high cost to interact withthe environment. In this paper, we propose a single-loop deep Actor-Critic(SLDAC) algorithmic framework for general constrained reinforcement learning(CRL) problems. In the actor step, the constrained stochastic successive convexapproximation (CSSCA) method is applied to handle the non-convex stochasticobjective and constraints. In the critic step, the critic DNNs are only updatedonce or a few finite times for each iteration, which simplifies the algorithmto a single-loop framework (the existing works require a sufficient number ofupdates for the critic step to ensure a good enough convergence of the innerloop for each iteration). Moreover, the variance of the policy gradientestimation is reduced by reusing observations from the old policy. Thesingle-loop design and the observation reuse effectively reduce theagent-environment interaction cost and computational complexity. In spite ofthe biased policy gradient estimation incurred by the single-loop design andobservation reuse, we prove that the SLDAC with a feasible initial point canconverge to a Karush-Kuhn-Tuker (KKT) point of the original problem almostsurely. Simulations show that the SLDAC algorithm can achieve superiorperformance with much lower interaction cost.