Moderate Actor-Critic Methods: Controlling Overestimation Bias via Expectile Loss

Abstract

Overestimation is a fundamental characteristic of model-free reinforcementlearning (MF-RL), arising from the principles of temporal difference learningand the approximation of the Q-function. To address this challenge, we proposea novel moderate target in the Q-function update, formulated as a convexoptimization of an overestimated Q-function and its lower bound. Our primarycontribution lies in the efficient estimation of this lower bound through thelower expectile of the Q-value distribution conditioned on a state. Notably,our moderate target integrates seamlessly into state-of-the-art (SOTA) MF-RLalgorithms, including Deep Deterministic Policy Gradient (DDPG) and Soft ActorCritic (SAC). Experimental results validate the effectiveness of our moderatetarget in mitigating overestimation bias in DDPG, SAC, and distributional RLalgorithms.

Quick Read (beta)

loading the full paper ...