Score as Action: Fine-Tuning Diffusion Generative Models by Continuous-time Reinforcement Learning

Abstract

Reinforcement learning from human feedback (RLHF), which aligns a diffusionmodel with input prompt, has become a crucial step in building reliablegenerative AI models. Most works in this area use a discrete-time formulation,which is prone to induced errors, and often not applicable to models withhigher-order/black-box solvers. The objective of this study is to develop adisciplined approach to fine-tune diffusion models using continuous-time RL,formulated as a stochastic control problem with a reward function that alignsthe end result (terminal state) with input prompt. The key idea is to treatscore matching as controls or actions, and thereby making connections to policyoptimization and regularization in continuous-time RL. To carry out this idea,we lay out a new policy optimization framework for continuous-time RL, andillustrate its potential in enhancing the value networks design space vialeveraging the structural property of diffusion models. We validate theadvantages of our method by experiments in downstream tasks of fine-tuninglarge-scale Text2Image models of Stable Diffusion v1.5.

Quick Read (beta)

loading the full paper ...