TempFlow-GRPO: When Timing Matters for GRPO in Flow Models

Abstract

Recent flow matching models for text-to-image generation have achievedremarkable quality, yet their integration with reinforcement learning for humanpreference alignment remains suboptimal, hindering fine-grained reward-basedoptimization. We observe that the key impediment to effective GRPO training offlow models is the temporal uniformity assumption in existing approaches:sparse terminal rewards with uniform credit assignment fail to capture thevarying criticality of decisions across generation timesteps, resulting ininefficient exploration and suboptimal convergence. To remedy this shortcoming,we introduce \textbf{TempFlow-GRPO} (Temporal Flow GRPO), a principled GRPOframework that captures and exploits the temporal structure inherent inflow-based generation. TempFlow-GRPO introduces two key innovations: (i) atrajectory branching mechanism that provides process rewards by concentratingstochasticity at designated branching points, enabling precise creditassignment without requiring specialized intermediate reward models; and (ii) anoise-aware weighting scheme that modulates policy optimization according tothe intrinsic exploration potential of each timestep, prioritizing learningduring high-impact early stages while ensuring stable refinement in laterphases. These innovations endow the model with temporally-aware optimizationthat respects the underlying generative dynamics, leading to state-of-the-artperformance in human preference alignment and standard text-to-imagebenchmarks.

Quick Read (beta)

loading the full paper ...