Global Convergence of the ODE Limit for Online Actor-Critic Algorithms in Reinforcement Learning

Abstract

Actor-critic algorithms are widely used in reinforcement learning, but arechallenging to mathematically analyse due to the online arrival of non-i.i.d.data samples. The distribution of the data samples dynamically changes as themodel is updated, introducing a complex feedback loop between the datadistribution and the reinforcement learning algorithm. We prove that, under atime rescaling, the online actor-critic algorithm with tabular parametrizationconverges to an ordinary differential equation (ODE) as the number of updatesbecomes large. The proof first establishes the geometric ergodicity of the datasamples under a fixed actor policy. Then, using a Poisson equation, we provethat the fluctuations of the data samples around a dynamic probability measure,which is a function of the evolving actor model, vanish as the number ofupdates become large. Once the ODE limit has been derived, we study itsconvergence properties using a two time-scale analysis which asymptoticallyde-couples the critic ODE from the actor ODE. The convergence of the critic tothe solution of the Bellman equation and the actor to the optimal policy areproven. In addition, a convergence rate to this global minimum is alsoestablished. Our convergence analysis holds under specific choices for thelearning rates and exploration rates in the actor-critic algorithm, which couldprovide guidance for the implementation of actor-critic algorithms in practice.

Quick Read (beta)

loading the full paper ...