Abstract
Monocular dynamic reconstruction is a challenging and long-standing visionproblem due to the highly ill-posed nature of the task. Existing approaches arelimited in that they either depend on templates, are effective only inquasi-static scenes, or fail to model 3D motion explicitly. In this work, weintroduce a method capable of reconstructing generic dynamic scenes, featuringexplicit, full-sequence-long 3D motion, from casually captured monocularvideos. We tackle the under-constrained nature of the problem with two keyinsights: First, we exploit the low-dimensional structure of 3D motion byrepresenting scene motion with a compact set of SE3 motion bases. Each point'smotion is expressed as a linear combination of these bases, facilitating softdecomposition of the scene into multiple rigidly-moving groups. Second, weutilize a comprehensive set of data-driven priors, including monocular depthmaps and long-range 2D tracks, and devise a method to effectively consolidatethese noisy supervisory signals, resulting in a globally consistentrepresentation of the dynamic scene. Experiments show that our method achievesstate-of-the-art performance for both long-range 3D/2D motion estimation andnovel view synthesis on dynamic scenes. Project Page:https://shape-of-motion.github.io/