Abstract
When designing algorithms for finite-time-horizon episodic reinforcementlearning problems, a common approach is to introduce a fictitious discountfactor and use stationary policies for approximations. Empirically, it has beenshown that the fictitious discount factor helps reduce variance, and stationarypolicies serve to save the per-iteration computational cost. Theoretically,however, there is no existing work on convergence analysis for algorithms withthis fictitious discount recipe. This paper takes the first step towardsanalyzing these algorithms. It focuses on two vanilla policy gradient (VPG)variants: the first being a widely used variant with discounted advantageestimations (DAE), the second with an additional fictitious discount factor inthe score functions of the policy gradient estimators. Non-asymptoticconvergence guarantees are established for both algorithms, and the additionaldiscount factor is shown to reduce the bias introduced in DAE and thus improvethe algorithm convergence asymptotically. A key ingredient of our analysis isto connect three settings of Markov decision processes (MDPs): thefinite-time-horizon, the average reward and the discounted settings. To ourbest knowledge, this is the first theoretical guarantee on fictitious discountalgorithms for the episodic reinforcement learning of finite-time-horizon MDPs,which also leads to the (first) global convergence of policy gradient methodsfor finite-time-horizon episodic reinforcement learning.