Softmax Policy Gradient Methods Can Take Exponential Time to Converge

Abstract

The softmax policy gradient (PG) method, which performs gradient ascent undersoftmax policy parameterization, is arguably one of the de factoimplementations of policy optimization in modern reinforcement learning. For$\gamma$-discounted infinite-horizon tabular Markov decision processes (MDPs),remarkable progress has recently been achieved towards establishing globalconvergence of softmax PG methods in finding a near-optimal policy. However,prior results fall short of delineating clear dependencies of convergence rateson salient parameters such as the cardinality of the state space $\mathcal{S}$and the effective horizon $\frac{1}{1-\gamma}$, both of which could beexcessively large. In this paper, we deliver a pessimistic message regardingthe iteration complexity of softmax PG methods, despite assuming access toexact gradient computation. Specifically, we demonstrate that softmax PGmethods can take exponential time -- in terms of $|\mathcal{S}|$ and$\frac{1}{1-\gamma}$ -- to converge, even in the presence of a benign policyinitialization and an initial state distribution amenable to exploration. Thisis accomplished by characterizing the algorithmic dynamics over acarefully-constructed MDP containing only three actions. Our exponential lowerbound hints at the necessity of carefully adjusting update rules or enforcingproper regularization in accelerating PG methods.

Quick Read (beta)

loading the full paper ...