A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models

Abstract

Image editing aims to edit the given synthetic or real image to meet thespecific requirements from users. It is widely studied in recent years as apromising and challenging field of Artificial Intelligence Generative Content(AIGC). Recent significant advancement in this field is based on thedevelopment of text-to-image (T2I) diffusion models, which generate imagesaccording to text prompts. These models demonstrate remarkable generativecapabilities and have become widely used tools for image editing. T2I-basedimage editing methods significantly enhance editing performance and offer auser-friendly interface for modifying content guided by multimodal inputs. Inthis survey, we provide a comprehensive review of multimodal-guided imageediting techniques that leverage T2I diffusion models. First, we define thescope of image editing from a holistic perspective and detail various controlsignals and editing scenarios. We then propose a unified framework to formalizethe editing process, categorizing it into two primary algorithm families. Thisframework offers a design space for users to achieve specific goals.Subsequently, we present an in-depth analysis of each component within thisframework, examining the characteristics and applicable scenarios of differentcombinations. Given that training-based methods learn to directly map thesource image to target one under user guidance, we discuss them separately, andintroduce injection schemes of source image in different scenarios.Additionally, we review the application of 2D techniques to video editing,highlighting solutions for inter-frame inconsistency. Finally, we discuss openchallenges in the field and suggest potential future research directions. Wekeep tracing related works athttps://github.com/xinchengshuai/Awesome-Image-Editing.

Quick Read (beta)

loading the full paper ...