Abstract
In this work, we address dynamic view synthesis from monocular videos as aninverse problem in a training-free setting. By redesigning the noiseinitialization phase of a pre-trained video diffusion model, we enablehigh-fidelity dynamic view synthesis without any weight updates or auxiliarymodules. We begin by identifying a fundamental obstacle to deterministicinversion arising from zero-terminal signal-to-noise ratio (SNR) schedules andresolve it by introducing a novel noise representation, termed K-orderRecursive Noise Representation. We derive a closed form expression for thisrepresentation, enabling precise and efficient alignment between theVAE-encoded and the DDIM inverted latents. To synthesize newly visible regionsresulting from camera motion, we introduce Stochastic Latent Modulation, whichperforms visibility aware sampling over the latent space to complete occludedregions. Comprehensive experiments demonstrate that dynamic view synthesis canbe effectively performed through structured latent manipulation in the noiseinitialization phase.