Abstract
Conditional image synthesis is a crucial task with broad applications, suchas artistic creation and virtual reality. However, current generative methodsare often task-oriented with a narrow scope, handling a restricted conditionwith constrained applicability. In this paper, we propose a novel approach thattreats conditional image synthesis as the modular combination of diversefundamental condition units. Specifically, we divide conditions into threeprimary units: text, layout, and drag. To enable effective control over theseconditions, we design a dedicated alignment module for each. For the textcondition, we introduce a Dense Concept Alignment (DCA) module, which achievesdense visual-text alignment by drawing on diverse textual concepts. For thelayout condition, we propose a Dense Geometry Alignment (DGA) module to enforcecomprehensive geometric constraints that preserve the spatial configuration.For the drag condition, we introduce a Dense Motion Alignment (DMA) module toapply multi-level motion regularization, ensuring that each pixel follows itsdesired trajectory without visual artifacts. By flexibly inserting andcombining these alignment modules, our framework enhances the model'sadaptability to diverse conditional generation tasks and greatly expands itsapplication range. Extensive experiments demonstrate the superior performanceof our framework across a variety of conditions, including textual description,segmentation mask (bounding box), drag manipulation, and their combinations.Code is available at https://github.com/ZixuanWang0525/DADG.