Adaptively Calibrated Critic Estimates for Deep Reinforcement Learning

Abstract

Accurate value estimates are important for off-policy reinforcement learning.Algorithms based on temporal difference learning typically are prone to anover- or underestimation bias building up over time. In this paper, we proposea general method called Adaptively Calibrated Critics (ACC) that uses the mostrecent high variance but unbiased on-policy rollouts to alleviate the bias ofthe low variance temporal difference targets. We apply ACC to TruncatedQuantile Critics, which is an algorithm for continuous control that allowsregulation of the bias with a hyperparameter tuned per environment. Theresulting algorithm adaptively adjusts the parameter during training renderinghyperparameter search unnecessary and sets a new state of the art on the OpenAIgym continuous control benchmark among all algorithms that do not tunehyperparameters for each environment. Additionally, we demonstrate that ACC isquite general by further applying it to TD3 and showing an improved performancealso in this setting.

Quick Read (beta)

loading the full paper ...