Preferred-Action-Optimized Diffusion Policies for Offline Reinforcement Learning

Abstract

Offline reinforcement learning (RL) aims to learn optimal policies frompreviously collected datasets. Recently, due to their powerful representationalcapabilities, diffusion models have shown significant potential as policymodels for offline RL issues. However, previous offline RL algorithms based ondiffusion policies generally adopt weighted regression to improve the policy.This approach optimizes the policy only using the collected actions and issensitive to Q-values, which limits the potential for further performanceenhancement. To this end, we propose a novel preferred-action-optimizeddiffusion policy for offline RL. In particular, an expressive conditionaldiffusion model is utilized to represent the diverse distribution of a behaviorpolicy. Meanwhile, based on the diffusion model, preferred actions within thesame behavior distribution are automatically generated through the criticfunction. Moreover, an anti-noise preference optimization is designed toachieve policy improvement by using the preferred actions, which can adapt tonoise-preferred actions for stable training. Extensive experiments demonstratethat the proposed method provides competitive or superior performance comparedto previous state-of-the-art offline RL methods, particularly in sparse rewardtasks such as Kitchen and AntMaze. Additionally, we empirically prove theeffectiveness of anti-noise preference optimization.

Quick Read (beta)

loading the full paper ...