Abstract
Overestimation is a fundamental characteristic of model-free reinforcementlearning (MF-RL), arising from the principles of temporal difference learningand the approximation of the Q-function. To address this challenge, we proposea novel moderate target in the Q-function update, formulated as a convexoptimization of an overestimated Q-function and its lower bound. Our primarycontribution lies in the efficient estimation of this lower bound through thelower expectile of the Q-value distribution conditioned on a state. Notably,our moderate target integrates seamlessly into state-of-the-art (SOTA) MF-RLalgorithms, including Deep Deterministic Policy Gradient (DDPG) and Soft ActorCritic (SAC). Experimental results validate the effectiveness of our moderatetarget in mitigating overestimation bias in DDPG, SAC, and distributional RLalgorithms.