Pitfall of Optimism: Distributional Reinforcement Learning by Randomizing Risk Criterion

Abstract

Distributional reinforcement learning algorithms have attempted to utilizeestimated uncertainty for exploration, such as optimism in the face ofuncertainty. However, using the estimated variance for optimistic explorationmay cause biased data collection and hinder convergence or performance. In thispaper, we present a novel distributional reinforcement learning algorithm thatselects actions by randomizing risk criterion to avoid one-sided tendency onrisk. We provide a perturbed distributional Bellman optimality operator bydistorting the risk measure and prove the convergence and optimality of theproposed method with the weaker contraction property. Our theoretical resultssupport that the proposed method does not fall into biased exploration and isguaranteed to converge to an optimal return. Finally, we empirically show thatour method outperforms other existing distribution-based algorithms in variousenvironments including Atari 55 games.

Quick Read (beta)

loading the full paper ...