Diffusion Policies for Out-of-Distribution Generalization in Offline Reinforcement Learning

Abstract

Offline Reinforcement Learning (RL) methods leverage previous experiences tolearn better policies than the behavior policy used for data collection. Incontrast to behavior cloning, which assumes the data is collected from expertdemonstrations, offline RL can work with non-expert data and multimodalbehavior policies. However, offline RL algorithms face challenges in handlingdistribution shifts and effectively representing policies due to the lack ofonline interaction during training. Prior work on offline RL uses conditionaldiffusion models to represent multimodal behavior in the dataset. Nevertheless,these methods are not tailored toward alleviating the out-of-distribution stategeneralization. We introduce a novel method named State Reconstruction forDiffusion Policies (SRDP), incorporating state reconstruction feature learningin the recent class of diffusion policies to address the out-of-distributiongeneralization problem. State reconstruction loss promotes generalizablerepresentation learning of states to alleviate the distribution shift incurredby the out-of-distribution (OOD) states. We design a novel 2D MultimodalContextual Bandit environment to illustrate the OOD generalization and fasterconvergence of SRDP compared to prior algorithms. In addition, we assess theperformance of our model on D4RL continuous control benchmarks, namely thenavigation of an 8-DoF ant and forward locomotion of half-cheetah, hopper, andwalker2d, achieving state-of-the-art results.

Quick Read (beta)

loading the full paper ...