Abstract
Scene graphs offer a structured, hierarchical representation of images, withnodes and edges symbolizing objects and the relationships among them. It canserve as a natural interface for image editing, dramatically improvingprecision and flexibility. Leveraging this benefit, we introduce a newframework that integrates large language model (LLM) with Text2Image generativemodel for scene graph-based image editing. This integration enables precisemodifications at the object level and creative recomposition of scenes withoutcompromising overall image integrity. Our approach involves two primary stages:1) Utilizing a LLM-driven scene parser, we construct an image's scene graph,capturing key objects and their interrelationships, as well as parsingfine-grained attributes such as object masks and descriptions. Theseannotations facilitate concept learning with a fine-tuned diffusion model,representing each object with an optimized token and detailed descriptionprompt. 2) During the image editing phase, a LLM editing controller guides theedits towards specific areas. These edits are then implemented by anattention-modulated diffusion editor, utilizing the fine-tuned model to performobject additions, deletions, replacements, and adjustments. Through extensiveexperiments, we demonstrate that our framework significantly outperformsexisting image editing methods in terms of editing precision and sceneaesthetics.