Abstract
Distributional approaches to value-based reinforcement learning model theentire distribution of returns, rather than just their expected values, andhave recently been shown to yield state-of-the-art empirical performance. Thiswas demonstrated by the recently proposed C51 algorithm, based on categoricaldistributional reinforcement learning (CDRL) [Bellemare et al., 2017]. However,the theoretical properties of CDRL algorithms are not yet well understood. Inthis paper, we introduce a framework to analyse CDRL algorithms, establish theimportance of the projected distributional Bellman operator in distributionalRL, draw fundamental connections between CDRL and the Cram\'er distance, andgive a proof of convergence for sample-based categorical distributionalreinforcement learning algorithms.