Abstract
Stateful policies play an important role in reinforcement learning, such ashandling partially observable environments, enhancing robustness, or imposingan inductive bias directly into the policy structure. The conventional methodfor training stateful policies is Backpropagation Through Time (BPTT), whichcomes with significant drawbacks, such as slow training due to sequentialgradient propagation and the occurrence of vanishing or exploding gradients.The gradient is often truncated to address these issues, resulting in a biasedpolicy update. We present a novel approach for training stateful policies bydecomposing the latter into a stochastic internal state kernel and a statelesspolicy, jointly optimized by following the stateful policy gradient. Weintroduce different versions of the stateful policy gradient theorem, enablingus to easily instantiate stateful variants of popular reinforcement learningand imitation learning algorithms. Furthermore, we provide a theoreticalanalysis of our new gradient estimator and compare it with BPTT. We evaluateour approach on complex continuous control tasks, e.g., humanoid locomotion,and demonstrate that our gradient estimator scales effectively with taskcomplexity while offering a faster and simpler alternative to BPTT.