Diffusion Policies for Out-of-Distribution Generalization in Offline Reinforcement Learning

Abstract

Offline Reinforcement Learning (RL) methods leverage previous experiences tolearn better policies than the behavior policy used for data collection.However, they face challenges handling distribution shifts due to the lack ofonline interaction during training. To this end, we propose a novel methodnamed State Reconstruction for Diffusion Policies (SRDP) that incorporatesstate reconstruction feature learning in the recent class of diffusion policiesto address the problem of out-of-distribution (OOD) generalization. Our methodpromotes learning of generalizable state representation to alleviate thedistribution shift caused by OOD states. To illustrate the OOD generalizationand faster convergence of SRDP, we design a novel 2D Multimodal ContextualBandit environment and realize it on a 6-DoF real-world UR10 robot, as well asin simulation, and compare its performance with prior algorithms. Inparticular, we show the importance of the proposed state reconstruction viaablation studies. In addition, we assess the performance of our model onstandard continuous control benchmarks (D4RL), namely the navigation of an8-DoF ant and forward locomotion of half-cheetah, hopper, and walker2d,achieving state-of-the-art results. Finally, we demonstrate that our method canachieve 167% improvement over the competing baseline on a sparse continuouscontrol navigation task where various regions of the state space are removedfrom the offline RL dataset, including the region encapsulating the goal.

Quick Read (beta)

loading the full paper ...