Beyond Exact Gradients: Convergence of Stochastic Soft-Max Policy Gradient Methods with Entropy Regularization

  • 2021-10-19 17:21:09
  • Yuhao Ding, Junzi Zhang, Javad Lavaei
  • 0

Abstract

Entropy regularization is an efficient technique for encouraging explorationand preventing a premature convergence of (vanilla) policy gradient methods inreinforcement learning (RL). However, the theoretical understanding of entropyregularized RL algorithms has been limited. In this paper, we revisit theclassical entropy regularized policy gradient methods with the soft-max policyparametrization, whose convergence has so far only been established assumingaccess to exact gradient oracles. To go beyond this scenario, we propose thefirst set of (nearly) unbiased stochastic policy gradient estimators withtrajectory-level entropy regularization, with one being an unbiased visitationmeasure-based estimator and the other one being a nearly unbiased yet morepractical trajectory-based estimator. We prove that although the estimatorsthemselves are unbounded in general due to the additional logarithmic policyrewards introduced by the entropy term, the variances are uniformly bounded.This enables the development of the first set of convergence results forstochastic entropy regularized policy gradient methods to both stationarypoints and globally optimal policies. We also develop some improved samplecomplexity results under a good initialization.

 

Quick Read (beta)

loading the full paper ...