Abstract
We devise a distributional variant of gradient temporal-difference (TD)learning. Distributional reinforcement learning has been demonstrated tooutperform the regular one in the recent study\citep{bellemare2017distributional}. In the policy evaluation setting, wedesign two new algorithms called distributional GTD2 and distributional TDCusing the Cram{\'e}r distance on the distributional version of the Bellmanerror objective function, which inherits advantages of both the nonlineargradient TD algorithms and the distributional RL approach. In the controlsetting, we propose the distributional Greedy-GQ using the similar derivation.We prove the asymptotic almost-sure convergence of distributional GTD2 and TDCto a local optimal solution for general smooth function approximators, whichincludes neural networks that have been widely used in recent study to solvethe real-life RL problems. In each step, the computational complexities ofabove three algorithms are linear w.r.t.\ the number of the parameters of thefunction approximator, thus can be implemented efficiently for neural networks.