Abstract
Offline reinforcement learning (RL) aims to learn decision-making policiesfrom fixed datasets without online interactions, providing a practical solutionwhere online data collection is expensive or risky. However, offline RL oftensuffers from distribution shift, resulting in inaccurate evaluation andsubstantial overestimation on out-of-distribution (OOD) actions. To addressthis, existing approaches incorporate conservatism by indiscriminatelydiscouraging all OOD actions, thereby hindering the agent's ability togeneralize and exploit beneficial ones. In this paper, we proposeAdvantage-based Diffusion Actor-Critic (ADAC), a novel method thatsystematically evaluates OOD actions using the batch-optimal value function.Based on this evaluation, ADAC defines an advantage function to modulate theQ-function update, enabling more precise assessment of OOD action quality. Wedesign a custom PointMaze environment and collect datasets to visually revealthat advantage modulation can effectively identify and select superior OODactions. Extensive experiments show that ADAC achieves state-of-the-artperformance on almost all tasks in the D4RL benchmark, with particularly clearmargins on the more challenging tasks.