Wonderland: Navigating 3D Scenes from a Single Image

Abstract

This paper addresses a challenging question: How can we efficiently createhigh-quality, wide-scope 3D scenes from a single arbitrary image? Existingmethods face several constraints, such as requiring multi-view data,time-consuming per-scene optimization, low visual quality in backgrounds, anddistorted reconstructions in unseen areas. We propose a novel pipeline toovercome these limitations. Specifically, we introduce a large-scalereconstruction model that uses latents from a video diffusion model to predict3D Gaussian Splattings for the scenes in a feed-forward manner. The videodiffusion model is designed to create videos precisely following specifiedcamera trajectories, allowing it to generate compressed video latents thatcontain multi-view information while maintaining 3D consistency. We train the3D reconstruction model to operate on the video latent space with a progressivetraining strategy, enabling the efficient generation of high-quality,wide-scope, and generic 3D scenes. Extensive evaluations across variousdatasets demonstrate that our model significantly outperforms existing methodsfor single-view 3D scene generation, particularly with out-of-domain images.For the first time, we demonstrate that a 3D reconstruction model can beeffectively built upon the latent space of a diffusion model to realizeefficient 3D scene generation.

Quick Read (beta)

loading the full paper ...