Training-free Regional Prompting for Diffusion Transformers

Abstract

Diffusion models have demonstrated excellent capabilities in text-to-imagegeneration. Their semantic understanding (i.e., prompt following) ability hasalso been greatly improved with large language models (e.g., T5, Llama).However, existing models cannot perfectly handle long and complex text prompts,especially when the text prompts contain various objects with numerousattributes and interrelated spatial relationships. While many regionalprompting methods have been proposed for UNet-based models (SD1.5, SDXL), butthere are still no implementations based on the recent Diffusion Transformer(DiT) architecture, such as SD3 and FLUX.1.In this report, we propose andimplement regional prompting for FLUX.1 based on attention manipulation, whichenables DiT with fined-grained compositional text-to-image generationcapability in a training-free manner. Code is available athttps://github.com/antonioo-c/Regional-Prompting-FLUX.

Quick Read (beta)

loading the full paper ...