Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models

Abstract

Reinforcement Learning (RL) has emerged as a central paradigm for advancingLarge Language Models (LLMs), where pre-training and RL post-training share thesame log-likelihood formulation. In contrast, recent RL approaches fordiffusion models, most notably Denoising Diffusion Policy Optimization (DDPO),optimize an objective different from the pretraining objectives--score/flowmatching loss. In this work, we establish a novel theoretical analysis: DDPO isan implicit form of score/flow matching with noisy targets, which increasesvariance and slows convergence. Building on this analysis, we introduce\textbf{Advantage Weighted Matching (AWM)}, a policy-gradient method fordiffusion. It uses the same score/flow-matching loss as pretraining to obtain alower-variance objective and reweights each sample by its advantage. In effect,AWM raises the influence of high-reward samples and suppresses low-reward oneswhile keeping the modeling objective identical to pretraining. This unifiespretraining and RL conceptually and practically, is consistent withpolicy-gradient theory, reduces variance, and yields faster convergence. Thissimple yet effective design yields substantial benefits: on GenEval, OCR, andPickScore benchmarks, AWM delivers up to a $24\times$ speedup over Flow-GRPO(which builds on DDPO), when applied to Stable Diffusion 3.5 Medium and FLUX,without compromising generation quality. Code is available athttps://github.com/scxue/advantage_weighted_matching.

Quick Read (beta)

loading the full paper ...