Abstract
We present a unified framework capable of solving a broad range of 3D tasks.Our approach features a stateful recurrent model that continuously updates itsstate representation with each new observation. Given a stream of images, thisevolving state can be used to generate metric-scale pointmaps (per-pixel 3Dpoints) for each new input in an online fashion. These pointmaps reside withina common coordinate system, and can be accumulated into a coherent, dense scenereconstruction that updates as new images arrive. Our model, called CUT3R(Continuous Updating Transformer for 3D Reconstruction), captures rich priorsof real-world scenes: not only can it predict accurate pointmaps from imageobservations, but it can also infer unseen regions of the scene by probing atvirtual, unobserved views. Our method is simple yet highly flexible, naturallyaccepting varying lengths of images that may be either video streams orunordered photo collections, containing both static and dynamic content. Weevaluate our method on various 3D/4D tasks and demonstrate competitive orstate-of-the-art performance in each. Project Page: https://cut3r.github.io/