Abstract
We introduce PhysWorld, a framework that enables robot learning from videogeneration through physical world modeling. Recent video generation models cansynthesize photorealistic visual demonstrations from language commands andimages, offering a powerful yet underexplored source of training signals forrobotics. However, directly retargeting pixel motions from generated videos torobots neglects physics, often resulting in inaccurate manipulations. PhysWorldaddresses this limitation by coupling video generation with physical worldreconstruction. Given a single image and a task command, our method generatestask-conditioned videos and reconstructs the underlying physical world from thevideos, and the generated video motions are grounded into physically accurateactions through object-centric residual reinforcement learning with thephysical world model. This synergy transforms implicit visual guidance intophysically executable robotic trajectories, eliminating the need for real robotdata collection and enabling zero-shot generalizable robotic manipulation.Experiments on diverse real-world tasks demonstrate that PhysWorldsubstantially improves manipulation accuracy compared to previous approaches.Visit \href{https://pointscoder.github.io/PhysWorld_Web/}{the project webpage}for details.