Abstract
This paper proposes ConsistDreamer - a novel framework that lifts 2Ddiffusion models with 3D awareness and 3D consistency, thus enablinghigh-fidelity instruction-guided scene editing. To overcome the fundamentallimitation of missing 3D consistency in 2D diffusion models, our key insight isto introduce three synergetic strategies that augment the input of the 2Ddiffusion model to become 3D-aware and to explicitly enforce 3D consistencyduring the training process. Specifically, we design surrounding views ascontext-rich input for the 2D diffusion model, and generate 3D-consistent,structured noise instead of image-independent noise. Moreover, we introduceself-supervised consistency-enforcing training within the per-scene editingprocedure. Extensive evaluation shows that our ConsistDreamer achievesstate-of-the-art performance for instruction-guided scene editing acrossvarious scenes and editing instructions, particularly in complicatedlarge-scale indoor scenes from ScanNet++, with significantly improved sharpnessand fine-grained textures. Notably, ConsistDreamer stands as the first workcapable of successfully editing complex (e.g., plaid/checkered) patterns. Ourproject page is at immortalco.github.io/ConsistDreamer.