Abstract
Predictive manipulation has recently gained considerable attention in theEmbodied AI community due to its potential to improve robot policy performanceby leveraging predicted states. However, generating accurate future visualstates of robot-object interactions from world models remains a well-knownchallenge, particularly in achieving high-quality pixel-level representations.To this end, we propose LaDi-WM, a world model that predicts the latent spaceof future states using diffusion modeling. Specifically, LaDi-WM leverages thewell-established latent space aligned with pre-trained Visual Foundation Models(VFMs), which comprises both geometric features (DINO-based) and semanticfeatures (CLIP-based). We find that predicting the evolution of the latentspace is easier to learn and more generalizable than directly predictingpixel-level images. Building on LaDi-WM, we design a diffusion policy thatiteratively refines output actions by incorporating forecasted states, therebygenerating more consistent and accurate results. Extensive experiments on bothsynthetic and real-world benchmarks demonstrate that LaDi-WM significantlyenhances policy performance by 27.9\% on the LIBERO-LONG benchmark and 20\% onthe real-world scenario. Furthermore, our world model and policies achieveimpressive generalizability in real-world experiments.