The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning

Abstract

While distributional reinforcement learning (RL) has demonstrated empiricalsuccess, the question of when and why it is beneficial has remained unanswered.In this work, we provide one explanation for the benefits of distributional RLthrough the lens of small-loss bounds, which scale with the instance-dependentoptimal cost. If the optimal cost is small, our bounds are stronger than thosefrom non-distributional approaches. As warmup, we show that learning the costdistribution leads to small-loss regret bounds in contextual bandits (CB), andwe find that distributional CB empirically outperforms the state-of-the-art onthree challenging tasks. For online RL, we propose a distributionalversion-space algorithm that constructs confidence sets using maximumlikelihood estimation, and we prove that it achieves small-loss regret in thetabular MDPs and enjoys small-loss PAC bounds in latent variable models.Building on similar insights, we propose a distributional offline RL algorithmbased on the pessimism principle and prove that it enjoys small-loss PACbounds, which exhibit a novel robustness property. For both online and offlineRL, our results provide the first theoretical benefits of learningdistributions even when we only need the mean for making decisions.

Quick Read (beta)

loading the full paper ...