Learning Pessimism for Robust and Efficient Off-Policy Reinforcement Learning

Abstract

Popular off-policy deep reinforcement learning algorithms compensate foroverestimation bias during temporal-difference learning by utilizingpessimistic estimates of the expected target returns. In this work, we proposea novel learnable penalty to enact such pessimism, based on a new way toquantify the critic's epistemic uncertainty. Furthermore, we propose to learnthe penalty alongside the critic with dual TD-learning, a strategy to estimateand minimize the bias magnitude in the target returns. Our method enables us toaccurately counteract overestimation bias throughout training without incurringthe downsides of overly pessimistic targets. Empirically, by integrating ourmethod and other orthogonal improvements with popular off-policy algorithms, weachieve state-of-the-art results in continuous control tasks from bothproprioceptive and pixel observations.

Quick Read (beta)

loading the full paper ...