Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

Abstract

Offline reinforcement learning (RL), which aims to learn an optimal policyusing a previously collected static dataset, is an important paradigm of RL.Standard RL methods often perform poorly at this task due to the functionapproximation errors on out-of-distribution actions. While a variety ofregularization methods have been proposed to mitigate this issue, they areoften constrained by policy classes with limited expressiveness and sometimesresult in substantially suboptimal solutions. In this paper, we proposeDiffusion-QL that utilizes a conditional diffusion model as a highly expressivepolicy class for behavior cloning and policy regularization. In our approach,we learn an action-value function and we add a term maximizing action-valuesinto the training loss of a conditional diffusion model, which results in aloss that seeks optimal actions that are near the behavior policy. We show theexpressiveness of the diffusion model-based policy and the coupling of thebehavior cloning and policy improvement under the diffusion model bothcontribute to the outstanding performance of Diffusion-QL. We illustrate ourmethod and prior work in a simple 2D bandit example with a multimodal behaviorpolicy. We then show that our method can achieve state-of-the-art performanceon the majority of the D4RL benchmark tasks for offline RL.

Quick Read (beta)

loading the full paper ...