Abstract
Learning to understand dynamic 3D scenes from imagery is crucial forapplications ranging from robotics to scene reconstruction. Yet, unlike otherproblems where large-scale supervised training has enabled rapid progress,directly supervising methods for recovering 3D motion remains challenging dueto the fundamental difficulty of obtaining ground truth annotations. We presenta system for mining high-quality 4D reconstructions from internet stereoscopic,wide-angle videos. Our system fuses and filters the outputs of camera poseestimation, stereo depth estimation, and temporal tracking methods intohigh-quality dynamic 3D reconstructions. We use this method to generatelarge-scale data in the form of world-consistent, pseudo-metric 3D point cloudswith long-term motion trajectories. We demonstrate the utility of this data bytraining a variant of DUSt3R to predict structure and 3D motion from real-worldimage pairs, showing that training on our reconstructed data enablesgeneralization to diverse real-world scenes. Project page:https://stereo4d.github.io