Abstract
Large-scale text-to-image models enable a wide range of image editingtechniques, using text prompts or even spatial controls. However, applyingthese editing methods to multi-view images depicting a single scene leads to3D-inconsistent results. In this work, we focus on spatial control-basedgeometric manipulations and introduce a method to consolidate the editingprocess across various views. We build on two insights: (1) maintainingconsistent features throughout the generative process helps attain consistencyin multi-view editing, and (2) the queries in self-attention layerssignificantly influence the image structure. Hence, we propose to improve thegeometric consistency of the edited images by enforcing the consistency of thequeries. To do so, we introduce QNeRF, a neural radiance field trained on theinternal query features of the edited images. Once trained, QNeRF can render3D-consistent queries, which are then softly injected back into theself-attention layers during generation, greatly improving multi-viewconsistency. We refine the process through a progressive, iterative method thatbetter consolidates queries across the diffusion timesteps. We compare ourmethod to a range of existing techniques and demonstrate that it can achievebetter multi-view consistency and higher fidelity to the input scene. Theseadvantages allow us to train NeRFs with fewer visual artifacts, that are betteraligned with the target geometry.