Multi-step Reinforcement Learning: A Unifying Algorithm

Abstract

Unifying seemingly disparate algorithmic ideas to produce better performingalgorithms has been a longstanding goal in reinforcement learning. As a primaryexample, TD($\lambda$) elegantly unifies one-step TD prediction with MonteCarlo methods through the use of eligibility traces and the trace-decayparameter $\lambda$. Currently, there are a multitude of algorithms that can beused to perform TD control, including Sarsa, $Q$-learning, and Expected Sarsa.These methods are often studied in the one-step case, but they can be extendedacross multiple time steps to achieve better performance. Each of thesealgorithms is seemingly distinct, and no one dominates the others for allproblems. In this paper, we study a new multi-step action-value algorithmcalled $Q(\sigma)$ which unifies and generalizes these existing algorithms,while subsuming them as special cases. A new parameter, $\sigma$, is introducedto allow the degree of sampling performed by the algorithm at each step duringits backup to be continuously varied, with Sarsa existing at one extreme (fullsampling), and Expected Sarsa existing at the other (pure expectation).$Q(\sigma)$ is generally applicable to both on- and off-policy learning, but inthis work we focus on experiments in the on-policy case. Our results show thatan intermediate value of $\sigma$, which results in a mixture of the existingalgorithms, performs better than either extreme. The mixture can also be varieddynamically which can result in even greater performance.

Quick Read (beta)

loading the full paper ...