Abstract
Video-conditioned 4D shape generation aims to recover time-varying 3Dgeometry and view-consistent appearance directly from an input video. In thiswork, we introduce a native video-to-4D shape generation framework thatsynthesizes a single dynamic 3D representation end-to-end from the video. Ourframework introduces three key components based on large-scale pre-trained 3Dmodels: (i) a temporal attention that conditions generation on all frames whileproducing a time-indexed dynamic representation; (ii) a time-aware pointsampling and 4D latent anchoring that promote temporally consistent geometryand texture; and (iii) noise sharing across frames to enhance temporalstability. Our method accurately captures non-rigid motion, volume changes, andeven topological transitions without per-frame optimization. Across diversein-the-wild videos, our method improves robustness and perceptual fidelity andreduces failure modes compared with the baselines.