Learning from Random Demonstrations: Offline Reinforcement Learning with Importance-Sampled Diffusion Models

Abstract

Generative models such as diffusion have been employed as world models inoffline reinforcement learning to generate synthetic data for more effectivelearning. Existing work either generates diffusion models one-time prior totraining or requires additional interaction data to update it. In this paper,we propose a novel approach for offline reinforcement learning with closed-looppolicy evaluation and world-model adaptation. It iteratively leverages a guideddiffusion world model to directly evaluate the offline target policy withactions drawn from it, and then performs an importance-sampled world modelupdate to adaptively align the world model with the updated policy. We analyzedthe performance of the proposed method and provided an upper bound on thereturn gap between our method and the real environment under an optimal policy.The result sheds light on various factors affecting learning performance.Evaluations in the D4RL environment show significant improvement overstate-of-the-art baselines, especially when only random or medium-expertisedemonstrations are available -- thus requiring improved alignment between theworld model and offline policy evaluation.

Quick Read (beta)

loading the full paper ...