Abstract
We present a method to reconstruct time-consistent human body models frommonocular videos, focusing on extremely loose clothing or handheld objectinteractions. Prior work in human reconstruction is either limited to tightclothing with no object interactions, or requires calibrated multi-viewcaptures or personalized template scans which are costly to collect at scale.Our key insight for high-quality yet flexible reconstruction is the carefulcombination of generic human priors about articulated body shape (learned fromlarge-scale training data) with video-specific articulated "bag-of-bones"deformation (fit to a single video via test-time optimization). We accomplishthis by learning a neural implicit model that disentangles body versus clothingdeformations as separate motion model layers. To capture subtle geometry ofclothing, we leverage image-based priors such as human body pose, surfacenormals, and optical flow during optimization. The resulting neural fields canbe extracted into time-consistent meshes, or further optimized as explicit 3DGaussians for high-fidelity interactive rendering. On datasets with highlychallenging clothing deformations and object interactions, DressRecon yieldshigher-fidelity 3D reconstructions than prior art. Project page:https://jefftan969.github.io/dressrecon/