DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Abstract

The ability to predict future outcomes given control actions is fundamentalfor physical reasoning. However, such predictive models, often called worldmodels, have proven challenging to learn and are typically developed fortask-specific solutions with online policy learning. We argue that the truepotential of world models lies in their ability to reason and plan acrossdiverse problems using only passive data. Concretely, we require world modelsto have the following three properties: 1) be trainable on offline,pre-collected trajectories, 2) support test-time behavior optimization, and 3)facilitate task-agnostic reasoning. To realize this, we present DINO WorldModel (DINO-WM), a new method to model visual dynamics without reconstructingthe visual world. DINO-WM leverages spatial patch features pre-trained withDINOv2, enabling it to learn from offline behavioral trajectories by predictingfuture patch features. This design allows DINO-WM to achieve observationalgoals through action sequence optimization, facilitating task-agnostic behaviorplanning by treating desired goal patch features as prediction targets. Weevaluate DINO-WM across various domains, including maze navigation, tabletoppushing, and particle manipulation. Our experiments demonstrate that DINO-WMcan generate zero-shot behavioral solutions at test time without relying onexpert demonstrations, reward modeling, or pre-learned inverse models. Notably,DINO-WM exhibits strong generalization capabilities compared to priorstate-of-the-art work, adapting to diverse task families such as arbitrarilyconfigured mazes, push manipulation with varied object shapes, andmulti-particle scenarios.

Quick Read (beta)

loading the full paper ...