Abstract
Recent studies have demonstrated the effectiveness of directly aligningdiffusion models with human preferences using differentiable reward. However,they exhibit two primary challenges: (1) they rely on multistep denoising withgradient computation for reward scoring, which is computationally expensive,thus restricting optimization to only a few diffusion steps; (2) they oftenneed continuous offline adaptation of reward models in order to achieve desiredaesthetic quality, such as photorealism or precise lighting effects. To addressthe limitation of multistep denoising, we propose Direct-Align, a method thatpredefines a noise prior to effectively recover original images from any timesteps via interpolation, leveraging the equation that diffusion states areinterpolations between noise and target images, which effectively avoidsover-optimization in late timesteps. Furthermore, we introduce SemanticRelative Preference Optimization (SRPO), in which rewards are formulated astext-conditioned signals. This approach enables online adjustment of rewards inresponse to positive and negative prompt augmentation, thereby reducing thereliance on offline reward fine-tuning. By fine-tuning the FLUX.1.dev modelwith optimized denoising and online reward adjustment, we improve itshuman-evaluated realism and aesthetic quality by over 3x.