Consolidating Attention Features for Multi-view Image Editing

Abstract

Large-scale text-to-image models enable a wide range of image editingtechniques, using text prompts or even spatial controls. However, applyingthese editing methods to multi-view images depicting a single scene leads to3D-inconsistent results. In this work, we focus on spatial control-basedgeometric manipulations and introduce a method to consolidate the editingprocess across various views. We build on two insights: (1) maintainingconsistent features throughout the generative process helps attain consistencyin multi-view editing, and (2) the queries in self-attention layerssignificantly influence the image structure. Hence, we propose to improve thegeometric consistency of the edited images by enforcing the consistency of thequeries. To do so, we introduce QNeRF, a neural radiance field trained on theinternal query features of the edited images. Once trained, QNeRF can render3D-consistent queries, which are then softly injected back into theself-attention layers during generation, greatly improving multi-viewconsistency. We refine the process through a progressive, iterative method thatbetter consolidates queries across the diffusion timesteps. We compare ourmethod to a range of existing techniques and demonstrate that it can achievebetter multi-view consistency and higher fidelity to the input scene. Theseadvantages allow us to train NeRFs with fewer visual artifacts, that are betteraligned with the target geometry.

Quick Read (beta)

loading the full paper ...