On the Global Convergence of Momentum-based Policy Gradient

Abstract

Policy gradient (PG) methods are popular and efficient for large-scalereinforcement learning due to their relative stability and incremental nature.In recent years, the empirical success of PG methods has led to the developmentof a theoretical foundation for these methods. In this work, we generalize thisline of research by studying the global convergence of stochastic PG methodswith momentum terms, which have been demonstrated to be efficient recipes forimproving PG methods. We study both the soft-max and the Fisher-non-degeneratepolicy parametrizations, and show that adding a momentum improves the globaloptimality sample complexity of vanilla PG methods by$\tilde{\mathcal{O}}(\epsilon^{-1.5})$ and$\tilde{\mathcal{O}}(\epsilon^{-1})$, respectively, where $\epsilon>0$ is thetarget tolerance. Our work is the first one that obtains global convergenceresults for the momentum-based PG methods. For the genericFisher-non-degenerate policy parametrizations, our result is the firstsingle-loop and finite-batch PG algorithm achieving $\tilde{O}(\epsilon^{-3})$global optimality sample complexity. Finally, as a by-product, our methods alsoprovide general framework for analyzing the global convergence rates ofstochastic PG methods, which can be easily applied and extended to different PGestimators.

Quick Read (beta)

loading the full paper ...