Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

Abstract

Imitation learning has emerged as a promising approach towards buildinggeneralist robots. However, scaling imitation learning for large robotfoundation models remains challenging due to its reliance on high-qualityexpert demonstrations. Meanwhile, large amounts of video data depicting a widerange of environments and diverse behaviors are readily available. This dataprovides a rich source of information about real-world dynamics andagent-environment interactions. Leveraging this data directly for imitationlearning, however, has proven difficult due to the lack of action annotationrequired for most contemporary methods. In this work, we present Unified WorldModels (UWM), a framework that allows for leveraging both video and action datafor policy learning. Specifically, a UWM integrates an action diffusion processand a video diffusion process within a unified transformer architecture, whereindependent diffusion timesteps govern each modality. We show that by simplycontrolling each diffusion timestep, UWM can flexibly represent a policy, aforward dynamics, an inverse dynamics, and a video generator. Through simulatedand real-world experiments, we show that: (1) UWM enables effective pretrainingon large-scale multitask robot datasets with both dynamics and actionpredictions, resulting in more generalizable and robust policies than imitationlearning, (2) UWM naturally facilitates learning from action-free video datathrough independent control of modality-specific diffusion timesteps, furtherimproving the performance of finetuned policies. Our results suggest that UWMoffers a promising step toward harnessing large, heterogeneous datasets forscalable robot learning, and provides a simple unification between the oftendisparate paradigms of imitation learning and world modeling. Videos and codeare available at https://weirdlabuw.github.io/uwm/.

Quick Read (beta)

loading the full paper ...