GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping

Abstract

Recently, GRPO-based reinforcement learning has shown remarkable progress inoptimizing flow-matching models, effectively improving their alignment withtask-specific rewards. Within these frameworks, the policy update relies onimportance-ratio clipping to constrain overconfident positive and negativegradients. However, in practice, we observe a systematic shift in theimportance-ratio distribution-its mean falls below 1 and its variance differssubstantially across timesteps. This left-shifted and inconsistent distributionprevents positive-advantage samples from entering the clipped region, causingthe mechanism to fail in constraining overconfident positive updates. As aresult, the policy model inevitably enters an implicit over-optimizationstage-while the proxy reward continues to increase, essential metrics such asimage quality and text-prompt alignment deteriorate sharply, ultimately makingthe learned policy impractical for real-world use. To address this issue, weintroduce GRPO-Guard, a simple yet effective enhancement to existing GRPOframeworks. Our method incorporates ratio normalization, which restores abalanced and step-consistent importance ratio, ensuring that PPO clippingproperly constrains harmful updates across denoising timesteps. In addition, agradient reweighting strategy equalizes policy gradients over noise conditions,preventing excessive updates from particular timestep regions. Together, thesedesigns act as a regulated clipping mechanism, stabilizing optimization andsubstantially mitigating implicit over-optimization without relying on heavy KLregularization. Extensive experiments on multiple diffusion backbones (e.g.,SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guardsignificantly reduces over-optimization while maintaining or even improvinggeneration quality.

Quick Read (beta)

loading the full paper ...