VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

Abstract

Recovering 3D scenes from sparse views is a challenging task due to itsinherent ill-posed problem. Conventional methods have developed specializedsolutions (e.g., geometry regularization or feed-forward deterministic model)to mitigate the issue. However, they still suffer from performance degradationby minimal overlap across input views with insufficient visual information.Fortunately, recent video generative models show promise in addressing thischallenge as they are capable of generating video clips with plausible 3Dstructures. Powered by large pretrained video diffusion models, some pioneeringresearch start to explore the potential of video generative prior and create 3Dscenes from sparse views. Despite impressive improvements, they are limited byslow inference time and the lack of 3D constraint, leading to inefficienciesand reconstruction artifacts that do not align with real-world geometrystructure. In this paper, we propose VideoScene to distill the video diffusionmodel to generate 3D scenes in one step, aiming to build an efficient andeffective tool to bridge the gap from video to 3D. Specifically, we design a3D-aware leap flow distillation strategy to leap over time-consuming redundantinformation and train a dynamic denoising policy network to adaptivelydetermine the optimal leap timestep during inference. Extensive experimentsdemonstrate that our VideoScene achieves faster and superior 3D scenegeneration results than previous video diffusion models, highlighting itspotential as an efficient tool for future video to 3D applications. ProjectPage: https://hanyang-21.github.io/VideoScene

Quick Read (beta)

loading the full paper ...