Bellman Unbiasedness: Tractable and Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation

Abstract

Distributional reinforcement learning improves performance by capturingenvironmental stochasticity, but a comprehensive theoretical understanding ofits effectiveness remains elusive. In addition, the intractable element of theinfinite dimensionality of distributions has been overlooked. In this paper, wepresent a regret analysis of distributional reinforcement learning with generalvalue function approximation in a finite episodic Markov decision processsetting. We first introduce a key notion of $\textit{Bellman unbiasedness}$which is essential for exactly learnable and provably efficient distributionalupdates in an online manner. Among all types of statistical functionals forrepresenting infinite-dimensional return distributions, our theoretical resultsdemonstrate that only moment functionals can exactly capture the statisticalinformation. Secondly, we propose a provably efficient algorithm,$\texttt{SF-LSVI}$, that achieves a tight regret bound of $\tilde{O}(d_EH^{\frac{3}{2}}\sqrt{K})$ where $H$ is the horizon, $K$ is the number ofepisodes, and $d_E$ is the eluder dimension of a function class.

Quick Read (beta)

loading the full paper ...