Abstract
Reinforcement learning (RL) has shown remarkable success in solving complexdecision-making and control tasks. However, many model-free RL algorithmsexperience performance degradation due to inaccurate value estimation,particularly the overestimation of Q-values, which can lead to suboptimalpolicies. To address this issue, we previously proposed the Distributional SoftActor-Critic (DSAC or DSACv1), an off-policy RL algorithm that enhances valueestimation accuracy by learning a continuous Gaussian value distribution.Despite its effectiveness, DSACv1 faces challenges such as training instabilityand sensitivity to reward scaling, caused by high variance in critic gradientsdue to return randomness. In this paper, we introduce three key refinements toDSACv1 to overcome these limitations and further improve Q-value estimationaccuracy: expected value substitution, twin value distribution learning, andvariance-based critic gradient adjustment. The enhanced algorithm, termed DSACwith Three refinements (DSAC-T or DSACv2), is systematically evaluated across adiverse set of benchmark tasks. Without the need for task-specifichyperparameter tuning, DSAC-T consistently matches or outperforms leadingmodel-free RL algorithms, including SAC, TD3, DDPG, TRPO, and PPO, in alltested environments. Additionally, DSAC-T ensures a stable learning process andmaintains robust performance across varying reward scales. Its effectiveness isfurther demonstrated through real-world application in controlling a wheeledrobot, highlighting its potential for deployment in practical robotic tasks.