Abstract
Text-to-Image diffusion models have made tremendous progress over the pasttwo years, enabling the generation of highly realistic images based onopen-domain text descriptions. However, despite their success, textdescriptions often struggle to adequately convey detailed controls, even whencomposed of long and complex texts. Moreover, recent studies have also shownthat these models face challenges in understanding such complex texts andgenerating the corresponding images. Therefore, there is a growing need toenable more control modes beyond text description. In this paper, we introduceUni-ControlNet, a novel approach that allows for the simultaneous utilizationof different local controls (e.g., edge maps, depth map, segmentation masks)and global controls (e.g., CLIP image embeddings) in a flexible and composablemanner within one model. Unlike existing methods, Uni-ControlNet only requiresthe fine-tuning of two additional adapters upon frozen pre-trainedtext-to-image diffusion models, eliminating the huge cost of training fromscratch. Moreover, thanks to some dedicated adapter designs, Uni-ControlNetonly necessitates a constant number (i.e., 2) of adapters, regardless of thenumber of local or global controls used. This not only reduces the fine-tuningcosts and model size, making it more suitable for real-world deployment, butalso facilitate composability of different conditions. Through bothquantitative and qualitative comparisons, Uni-ControlNet demonstrates itssuperiority over existing methods in terms of controllability, generationquality and composability. Code is available at\url{https://github.com/ShihaoZhaoZSH/Uni-ControlNet}.