WildCAT3D: Appearance-Aware Multi-View Diffusion in the Wild

Abstract

Despite recent advances in sparse novel view synthesis (NVS) applied toobject-centric scenes, scene-level NVS remains a challenge. A central issue isthe lack of available clean multi-view training data, beyond manually curateddatasets with limited diversity, camera variation, or licensing issues. On theother hand, an abundance of diverse and permissively-licensed data exists inthe wild, consisting of scenes with varying appearances (illuminations,transient occlusions, etc.) from sources such as tourist photos. To this end,we present WildCAT3D, a framework for generating novel views of scenes learnedfrom diverse 2D scene image data captured in the wild. We unlock training onthese data sources by explicitly modeling global appearance conditions inimages, extending the state-of-the-art multi-view diffusion paradigm to learnfrom scene views of varying appearances. Our trained model generalizes to newscenes at inference time, enabling the generation of multiple consistent novelviews. WildCAT3D provides state-of-the-art results on single-view NVS inobject- and scene-level settings, while training on strictly less data sourcesthan prior methods. Additionally, it enables novel applications by providingglobal appearance control during generation.

Quick Read (beta)

loading the full paper ...