Abstract
We introduce Geo4D, a method to repurpose video diffusion models formonocular 3D reconstruction of dynamic scenes. By leveraging the strong dynamicprior captured by such video models, Geo4D can be trained using only syntheticdata while generalizing well to real data in a zero-shot manner. Geo4D predictsseveral complementary geometric modalities, namely point, depth, and ray maps.It uses a new multi-modal alignment algorithm to align and fuse thesemodalities, as well as multiple sliding windows, at inference time, thusobtaining robust and accurate 4D reconstruction of long videos. Extensiveexperiments across multiple benchmarks show that Geo4D significantly surpassesstate-of-the-art video depth estimation methods, including recent methods suchas MonST3R, which are also designed to handle dynamic scenes.