ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

Abstract

To enhance the controllability of text-to-image diffusion models, existingefforts like ControlNet incorporated image-based conditional controls. In thispaper, we reveal that existing methods still face significant challenges ingenerating images that align with the image conditional controls. To this end,we propose ControlNet++, a novel approach that improves controllable generationby explicitly optimizing pixel-level cycle consistency between generated imagesand conditional controls. Specifically, for an input conditional control, weuse a pre-trained discriminative reward model to extract the correspondingcondition of the generated images, and then optimize the consistency lossbetween the input conditional control and extracted condition. Astraightforward implementation would be generating images from random noisesand then calculating the consistency loss, but such an approach requiresstoring gradients for multiple sampling timesteps, leading to considerable timeand memory costs. To address this, we introduce an efficient reward strategythat deliberately disturbs the input images by adding noise, and then uses thesingle-step denoised images for reward fine-tuning. This avoids the extensivecosts associated with image sampling, allowing for more efficient rewardfine-tuning. Extensive experiments show that ControlNet++ significantlyimproves controllability under various conditional controls. For example, itachieves improvements over ControlNet by 7.9% mIoU, 13.4% SSIM, and 7.6% RMSE,respectively, for segmentation mask, line-art edge, and depth conditions.

Quick Read (beta)

loading the full paper ...