Abstract
Distributional reinforcement learning improves performance by effectivelycapturing environmental stochasticity, but a comprehensive theoreticalunderstanding of its effectiveness remains elusive. In this paper, we present aregret analysis for distributional reinforcement learning with general valuefunction approximation in a finite episodic Markov decision process setting. Wefirst introduce a key notion of Bellman unbiasedness for a tractable andexactly learnable update via statistical functional dynamic programming. Ourtheoretical results show that approximating the infinite-dimensional returndistribution with a finite number of moment functionals is the only method tolearn the statistical information unbiasedly, including nonlinear statisticalfunctionals. Second, we propose a provably efficient algorithm,$\texttt{SF-LSVI}$, achieving a regret bound of $\tilde{O}(d_EH^{\frac{3}{2}}\sqrt{K})$ where $H$ is the horizon, $K$ is the number ofepisodes, and $d_E$ is the eluder dimension of a function class.