Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

Abstract

We explore fixed-horizon temporal difference (TD) methods, reinforcementlearning algorithms for a new kind of value function that predicts the sum ofrewards over a $\textit{fixed}$ number of future time steps. To learn the valuefunction for horizon $h$, these algorithms bootstrap from the value functionfor horizon $h-1$, or some shorter horizon. Because no value functionbootstraps from itself, fixed-horizon methods are immune to the stabilityproblems that plague other off-policy TD methods using function approximation(also known as "the deadly triad"). Although fixed-horizon methods require thestorage of additional value functions, this gives the agent additionalpredictive power, while the added complexity can be substantially reduced viaparallel updates, shared weights, and $n$-step bootstrapping. We show how touse fixed-horizon value functions to solve reinforcement learning problemscompetitively with methods such as Q-learning that learn conventional valuefunctions. We also prove convergence of fixed-horizon temporal differencemethods with linear and general function approximation. Taken together, ourresults establish fixed-horizon TD methods as a viable new way of avoiding thestability problems of the deadly triad.

Quick Read (beta)

loading the full paper ...