Abstract
Distributional reinforcement learning~(RL) is a class of state-of-the-artalgorithms that estimate the whole distribution of the total return rather thanonly its expectation. Despite the remarkable performance of distributional RL,a theoretical understanding of its advantages over expectation-based RL remainselusive. In this paper, we attribute the superiority of distributional RL toits regularization effect in terms of the value distribution informationregardless of its expectation. Firstly, by leverage of a variant of the grosserror model in robust statistics, we decompose the value distribution into itsexpectation and the remaining distribution part. As such, the extra benefit ofdistributional RL compared with expectation-based RL is mainly interpreted asthe impact of a \textit{risk-sensitive entropy regularization} within theNeural Fitted Z-Iteration framework. Meanwhile, we establish a bridge betweenthe risk-sensitive entropy regularization of distributional RL and the vanillaentropy in maximum entropy RL, focusing specifically on actor-criticalgorithms. It reveals that distributional RL induces a corrected rewardfunction and thus promotes a risk-sensitive exploration against the intrinsicuncertainty of the environment. Finally, extensive experiments corroborate therole of the regularization effect of distributional RL and uncover mutualimpacts of different entropy regularization. Our research paves a way towardsbetter interpreting the efficacy of distributional RL algorithms, especiallythrough the lens of regularization.