Abstract
In this paper, we present a novel framework for video-to-4D generation thatcreates high-quality dynamic 3D content from single video inputs. Direct 4Ddiffusion modeling is extremely challenging due to costly data construction andthe high-dimensional nature of jointly representing 3D shape, appearance, andmotion. We address these challenges by introducing a Direct 4DMesh-to-GSVariation Field VAE that directly encodes canonical Gaussian Splats (GS) andtheir temporal variations from 3D animation data without per-instance fitting,and compresses high-dimensional animations into a compact latent space.Building upon this efficient representation, we train a Gaussian VariationField diffusion model with temporal-aware Diffusion Transformer conditioned oninput videos and canonical GS. Trained on carefully-curated animatable 3Dobjects from the Objaverse dataset, our model demonstrates superior generationquality compared to existing methods. It also exhibits remarkablegeneralization to in-the-wild video inputs despite being trained exclusively onsynthetic data, paving the way for generating high-quality animated 3D content.Project page: https://gvfdiffusion.github.io/.