Abstract
The integration of geometric reconstruction and generative modeling remains acritical challenge in developing AI systems capable of human-like spatialreasoning. This paper proposes Aether, a unified framework that enablesgeometry-aware reasoning in world models by jointly optimizing three corecapabilities: (1) 4D dynamic reconstruction, (2) action-conditioned videoprediction, and (3) goal-conditioned visual planning. Through task-interleavedfeature learning, Aether achieves synergistic knowledge sharing acrossreconstruction, prediction, and planning objectives. Building upon videogeneration models, our framework demonstrates unprecedented synthetic-to-realgeneralization despite never observing real-world data during training.Furthermore, our approach achieves zero-shot generalization in both actionfollowing and reconstruction tasks, thanks to its intrinsic geometric modeling.Remarkably, even without real-world data, its reconstruction performance farexceeds that of domain-specific models. Additionally, Aether leverages ageometry-informed action space to seamlessly translate predictions intoactions, enabling effective autonomous trajectory planning. We hope our workinspires the community to explore new frontiers in physically-reasonable worldmodeling and its applications.