Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos

Abstract

Learning to understand dynamic 3D scenes from imagery is crucial forapplications ranging from robotics to scene reconstruction. Yet, unlike otherproblems where large-scale supervised training has enabled rapid progress,directly supervising methods for recovering 3D motion remains challenging dueto the fundamental difficulty of obtaining ground truth annotations. We presenta system for mining high-quality 4D reconstructions from internet stereoscopic,wide-angle videos. Our system fuses and filters the outputs of camera poseestimation, stereo depth estimation, and temporal tracking methods intohigh-quality dynamic 3D reconstructions. We use this method to generatelarge-scale data in the form of world-consistent, pseudo-metric 3D point cloudswith long-term motion trajectories. We demonstrate the utility of this data bytraining a variant of DUSt3R to predict structure and 3D motion from real-worldimage pairs, showing that training on our reconstructed data enablesgeneralization to diverse real-world scenes. Project page:https://stereo4d.github.io

Quick Read (beta)

loading the full paper ...