Abstract
Online reinforcement learning (RL) has been central to post-training languagemodels, but its extension to diffusion models remains challenging due tointractable likelihoods. Recent works discretize the reverse sampling processto enable GRPO-style training, yet they inherit fundamental drawbacks,including solver restrictions, forward-reverse inconsistency, and complicatedintegration with classifier-free guidance (CFG). We introduce DiffusionNegative-aware FineTuning (DiffusionNFT), a new online RL paradigm thatoptimizes diffusion models directly on the forward process via flow matching.DiffusionNFT contrasts positive and negative generations to define an implicitpolicy improvement direction, naturally incorporating reinforcement signalsinto the supervised learning objective. This formulation enables training witharbitrary black-box solvers, eliminates the need for likelihood estimation, andrequires only clean images rather than sampling trajectories for policyoptimization. DiffusionNFT is up to $25\times$ more efficient than FlowGRPO inhead-to-head comparisons, while being CFG-free. For instance, DiffusionNFTimproves the GenEval score from 0.24 to 0.98 within 1k steps, while FlowGRPOachieves 0.95 with over 5k steps and additional CFG employment. By leveragingmultiple reward models, DiffusionNFT significantly boosts the performance ofSD3.5-Medium in every benchmark tested.