SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing

Abstract

Scene graphs offer a structured, hierarchical representation of images, withnodes and edges symbolizing objects and the relationships among them. It canserve as a natural interface for image editing, dramatically improvingprecision and flexibility. Leveraging this benefit, we introduce a newframework that integrates large language model (LLM) with Text2Image generativemodel for scene graph-based image editing. This integration enables precisemodifications at the object level and creative recomposition of scenes withoutcompromising overall image integrity. Our approach involves two primary stages:1) Utilizing a LLM-driven scene parser, we construct an image's scene graph,capturing key objects and their interrelationships, as well as parsingfine-grained attributes such as object masks and descriptions. Theseannotations facilitate concept learning with a fine-tuned diffusion model,representing each object with an optimized token and detailed descriptionprompt. 2) During the image editing phase, a LLM editing controller guides theedits towards specific areas. These edits are then implemented by anattention-modulated diffusion editor, utilizing the fine-tuned model to performobject additions, deletions, replacements, and adjustments. Through extensiveexperiments, we demonstrate that our framework significantly outperformsexisting image editing methods in terms of editing precision and sceneaesthetics.

Quick Read (beta)

loading the full paper ...