DanceGRPO: Unleashing GRPO on Visual Generation

Abstract

Recent breakthroughs in generative models-particularly diffusion models andrectified flows-have revolutionized visual content creation, yet aligning modeloutputs with human preferences remains a critical challenge. Existingreinforcement learning (RL)-based methods for visual generation face criticallimitations: incompatibility with modern Ordinary Differential Equations(ODEs)-based sampling paradigms, instability in large-scale training, and lackof validation for video generation. This paper introduces DanceGRPO, the firstunified framework to adapt Group Relative Policy Optimization (GRPO) to visualgeneration paradigms, unleashing one unified RL algorithm across two generativeparadigms (diffusion models and rectified flows), three tasks (text-to-image,text-to-video, image-to-video), four foundation models (Stable Diffusion,HunyuanVideo, FLUX, SkyReel-I2V), and five reward models (image/videoaesthetics, text-image alignment, video motion quality, and binary reward). Toour knowledge, DanceGRPO is the first RL-based unified framework capable ofseamless adaptation across diverse generative paradigms, tasks, foundationalmodels, and reward models. DanceGRPO demonstrates consistent and substantialimprovements, which outperform baselines by up to 181% on benchmarks such asHPS-v2.1, CLIP Score, VideoAlign, and GenEval. Notably, DanceGRPO not only canstabilize policy optimization for complex video generation, but also enablesgenerative policy to better capture denoising trajectories for Best-of-Ninference scaling and learn from sparse binary feedback. Our results establishDanceGRPO as a robust and versatile solution for scaling Reinforcement Learningfrom Human Feedback (RLHF) tasks in visual generation, offering new insightsinto harmonizing reinforcement learning and visual synthesis. The code will bereleased.

Quick Read (beta)

loading the full paper ...