Soft Diffusion Actor-Critic: Efficient Online Reinforcement Learning for Diffusion Policy

Abstract

Diffusion policies have achieved superior performance in imitation learningand offline reinforcement learning (RL) due to their rich expressiveness.However, the vanilla diffusion training procedure requires samples from targetdistribution, which is impossible in online RL since we cannot sample from theoptimal policy, making training diffusion policies highly non-trivial in onlineRL. Backpropagating policy gradient through the diffusion process incurs hugecomputational costs and instability, thus being expensive and impractical. Toenable efficient diffusion policy training for online RL, we propose SoftDiffusion Actor-Critic (SDAC), exploiting the viewpoint of diffusion models asnoise-perturbed energy-based models. The proposed SDAC relies solely on thestate-action value function as the energy functions to train diffusionpolicies, bypassing sampling from the optimal policy while maintaininglightweight computations. We conducted comprehensive comparisons on MuJoCobenchmarks. The empirical results show that SDAC outperforms all recentdiffusion-policy online RLs on most tasks, and improves more than 120% oversoft actor-critic on complex locomotion tasks such as Humanoid and Ant.

Quick Read (beta)

loading the full paper ...