Prompt-to-Prompt Image Editing with Cross Attention Control

Abstract

Recent large-scale text-driven synthesis models have attracted much attentionthanks to their remarkable capabilities of generating highly diverse imagesthat follow given text prompts. Such text-based synthesis methods areparticularly appealing to humans who are used to verbally describe theirintent. Therefore, it is only natural to extend the text-driven image synthesisto text-driven image editing. Editing is challenging for these generativemodels, since an innate property of an editing technique is to preserve most ofthe original image, while in the text-based models, even a small modificationof the text prompt often leads to a completely different outcome.State-of-the-art methods mitigate this by requiring the users to provide aspatial mask to localize the edit, hence, ignoring the original structure andcontent within the masked region. In this paper, we pursue an intuitiveprompt-to-prompt editing framework, where the edits are controlled by textonly. To this end, we analyze a text-conditioned model in depth and observethat the cross-attention layers are the key to controlling the relation betweenthe spatial layout of the image to each word in the prompt. With thisobservation, we present several applications which monitor the image synthesisby editing the textual prompt only. This includes localized editing byreplacing a word, global editing by adding a specification, and even delicatelycontrolling the extent to which a word is reflected in the image. We presentour results over diverse images and prompts, demonstrating high-qualitysynthesis and fidelity to the edited prompts.

Quick Read (beta)

loading the full paper ...