Learning from Reward-Free Offline Data: A Case for Planning with Latent Dynamics Models

Abstract

A long-standing goal in AI is to build agents that can solve a variety oftasks across different environments, including previously unseen ones. Twodominant approaches tackle this challenge: (i) reinforcement learning (RL),which learns policies through trial and error, and (ii) optimal control, whichplans actions using a learned or known dynamics model. However, their relativestrengths and weaknesses remain underexplored in the setting where agents mustlearn from offline trajectories without reward annotations. In this work, wesystematically analyze the performance of different RL and control-basedmethods under datasets of varying quality. On the RL side, we considergoal-conditioned and zero-shot approaches. On the control side, we train alatent dynamics model using the Joint Embedding Predictive Architecture (JEPA)and use it for planning. We study how dataset properties-such as datadiversity, trajectory quality, and environment variability-affect theperformance of these approaches. Our results show that model-free RL excelswhen abundant, high-quality data is available, while model-based planningexcels in generalization to novel environment layouts, trajectory stitching,and data-efficiency. Notably, planning with a latent dynamics model emerges asa promising approach for zero-shot generalization from suboptimal data.

Quick Read (beta)

loading the full paper ...