A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning

Abstract

Recently, a new multi-step temporal learning algorithm, called $Q(\sigma)$,unifies $n$-step Tree-Backup (when $\sigma=0$) and $n$-step Sarsa (when$\sigma=1$) by introducing a sampling parameter $\sigma$. However, similar toother multi-step temporal-difference learning algorithms, $Q(\sigma)$ needsmuch memory consumption and computation time. Eligibility trace is an importantmechanism to transform the off-line updates into efficient on-line ones whichconsume less memory and computation time. In this paper, we further develop theoriginal $Q(\sigma)$, combine it with eligibility traces and propose a newalgorithm, called $Q(\sigma ,\lambda)$, in which $\lambda$ is trace-decayparameter. This idea unifies Sarsa$(\lambda)$ (when $\sigma =1$) and$Q^{\pi}(\lambda)$ (when $\sigma =0$). Furthermore, we give an upper errorbound of $Q(\sigma ,\lambda)$ policy evaluation algorithm. We prove that$Q(\sigma,\lambda)$ control algorithm can converge to the optimal valuefunction exponentially. We also empirically compare it with conventionaltemporal-difference learning methods. Results show that, with an intermediatevalue of $\sigma$, $Q(\sigma ,\lambda)$ creates a mixture of the existingalgorithms that can learn the optimal value significantly faster than theextreme end ($\sigma=0$, or $1$).

Quick Read (beta)

loading the full paper ...