Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

Abstract

Text-to-image diffusion models are typically trained to optimize thelog-likelihood objective, which presents challenges in meeting specificrequirements for downstream tasks, such as image aesthetics and image-textalignment. Recent research addresses this issue by refining the diffusion U-Netusing human rewards through reinforcement learning or direct backpropagation.However, many of them overlook the importance of the text encoder, which istypically pretrained and fixed during training. In this paper, we demonstratethat by finetuning the text encoder through reinforcement learning, we canenhance the text-image alignment of the results, thereby improving the visualquality. Our primary motivation comes from the observation that the currenttext encoder is suboptimal, often requiring careful prompt adjustment. Whilefine-tuning the U-Net can partially improve performance, it remains sufferingfrom the suboptimal text encoder. Therefore, we propose to use reinforcementlearning with low-rank adaptation to finetune the text encoder based ontask-specific rewards, referred as \textbf{TexForce}. We first show thatfinetuning the text encoder can improve the performance of diffusion models.Then, we illustrate that TexForce can be simply combined with existing U-Netfinetuned models to get much better results without additional training.Finally, we showcase the adaptability of our method in diverse applications,including the generation of high-quality face and hand images.

Quick Read (beta)

loading the full paper ...