Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation

Abstract

Large-scale text-to-image generative models have been a revolutionarybreakthrough in the evolution of generative AI, allowing us to synthesizediverse images that convey highly complex visual concepts. However, a pivotalchallenge in leveraging such models for real-world content creation tasks isproviding users with control over the generated content. In this paper, wepresent a new framework that takes text-to-image synthesis to the realm ofimage-to-image translation -- given a guidance image and a target text prompt,our method harnesses the power of a pre-trained text-to-image diffusion modelto generate a new image that complies with the target text, while preservingthe semantic layout of the source image. Specifically, we observe andempirically demonstrate that fine-grained control over the generated structurecan be achieved by manipulating spatial features and their self-attentioninside the model. This results in a simple and effective approach, wherefeatures extracted from the guidance image are directly injected into thegeneration process of the target image, requiring no training or fine-tuningand applicable for both real or generated guidance images. We demonstratehigh-quality results on versatile text-guided image translation tasks,including translating sketches, rough drawings and animations into realisticimages, changing of the class and appearance of objects in a given image, andmodifications of global qualities such as lighting and color.

Quick Read (beta)

loading the full paper ...