Abstract
We address the problem of dynamic scene reconstruction from sparse-viewvideos. Prior work often requires dense multi-view captures with hundreds ofcalibrated cameras (e.g. Panoptic Studio). Such multi-view setups areprohibitively expensive to build and cannot capture diverse scenes in-the-wild.In contrast, we aim to reconstruct dynamic human behaviors, such as repairing abike or dancing, from a small set of sparse-view cameras with complete scenecoverage (e.g. four equidistant inward-facing static cameras). We find thatdense multi-view reconstruction methods struggle to adapt to this sparse-viewsetup due to limited overlap between viewpoints. To address these limitations,we carefully align independent monocular reconstructions of each camera toproduce time- and view-consistent dynamic scene reconstructions. Extensiveexperiments on PanopticStudio and Ego-Exo4D demonstrate that our methodachieves higher quality reconstructions than prior art, particularly whenrendering novel views. Code, data, and data-processing scripts are available onhttps://github.com/ImNotPrepared/MonoFusion.