GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Abstract

Current image generation and editing methods primarily process textualprompts as direct inputs without reasoning about visual composition andexplicit operations. We present Generation Chain-of-Thought (GoT), a novelparadigm that enables generation and editing through an explicit languagereasoning process before outputting images. This approach transformsconventional text-to-image generation and editing into a reasoning-guidedframework that analyzes semantic relationships and spatial arrangements. Wedefine the formulation of GoT and construct large-scale GoT datasets containingover 9M samples with detailed reasoning chains capturing semantic-spatialrelationships. To leverage the advantages of GoT, we implement a unifiedframework that integrates Qwen2.5-VL for reasoning chain generation with anend-to-end diffusion model enhanced by our novel Semantic-Spatial GuidanceModule. Experiments show our GoT framework achieves excellent performance onboth generation and editing tasks, with significant improvements overbaselines. Additionally, our approach enables interactive visual generation,allowing users to explicitly modify reasoning steps for precise imageadjustments. GoT pioneers a new direction for reasoning-driven visualgeneration and editing, producing images that better align with human intent.To facilitate future research, we make our datasets, code, and pretrainedmodels publicly available at https://github.com/rongyaofang/GoT.

Quick Read (beta)

loading the full paper ...