Reasoning with Latent Diffusion in Offline Reinforcement Learning

Abstract

Offline reinforcement learning (RL) holds promise as a means to learnhigh-reward policies from a static dataset, without the need for furtherenvironment interactions. However, a key challenge in offline RL lies ineffectively stitching portions of suboptimal trajectories from the staticdataset while avoiding extrapolation errors arising due to a lack of support inthe dataset. Existing approaches use conservative methods that are tricky totune and struggle with multi-modal data (as we show) or rely on noisy MonteCarlo return-to-go samples for reward conditioning. In this work, we propose anovel approach that leverages the expressiveness of latent diffusion to modelin-support trajectory sequences as compressed latent skills. This facilitateslearning a Q-function while avoiding extrapolation error viabatch-constraining. The latent space is also expressive and gracefully copeswith multi-modal data. We show that the learned temporally-abstract latentspace encodes richer task-specific information for offline RL tasks as comparedto raw state-actions. This improves credit assignment and facilitates fasterreward propagation during Q-learning. Our method demonstrates state-of-the-artperformance on the D4RL benchmarks, particularly excelling in long-horizon,sparse-reward tasks.

Quick Read (beta)

loading the full paper ...