Abstract
The softmax policy gradient (PG) method, which performs gradient ascent undersoftmax policy parameterization, is arguably one of the de factoimplementations of policy optimization in modern reinforcement learning. For$\gamma$-discounted infinite-horizon tabular Markov decision processes (MDPs),remarkable progress has recently been achieved towards establishing globalconvergence of softmax PG methods in finding a near-optimal policy. However,prior results fall short of delineating clear dependencies of convergence rateson salient parameters such as the cardinality of the state space $\mathcal{S}$and the effective horizon $\frac{1}{1-\gamma}$, both of which could beexcessively large. In this paper, we deliver a pessimistic message regardingthe iteration complexity of softmax PG methods, despite assuming access toexact gradient computation. Specifically, we demonstrate that softmax PGmethods can take exponential time -- in terms of $|\mathcal{S}|$ and$\frac{1}{1-\gamma}$ -- to converge, even in the presence of a benign policyinitialization and an initial state distribution amenable to exploration. Thisis accomplished by characterizing the algorithmic dynamics over acarefully-constructed MDP containing only three actions. Our exponential lowerbound hints at the necessity of carefully adjusting update rules or enforcingproper regularization in accelerating PG methods.