Flow Matching Policy Gradients

  • 2025-07-28 17:59:57
  • David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, Angjoo Kanazawa
  • 0

Abstract

Flow-based generative models, including diffusion models, excel at modelingcontinuous distributions in high-dimensional spaces. In this work, we introduceFlow Policy Optimization (FPO), a simple on-policy reinforcement learningalgorithm that brings flow matching into the policy gradient framework. FPOcasts policy optimization as maximizing an advantage-weighted ratio computedfrom the conditional flow matching loss, in a manner compatible with thepopular PPO-clip framework. It sidesteps the need for exact likelihoodcomputation while preserving the generative capabilities of flow-based models.Unlike prior approaches for diffusion-based reinforcement learning that bindtraining to a specific sampling method, FPO is agnostic to the choice ofdiffusion or flow integration at both training and inference time. We show thatFPO can train diffusion-style policies from scratch in a variety of continuouscontrol tasks. We find that flow-based models can capture multimodal actiondistributions and achieve higher performance than Gaussian policies,particularly in under-conditioned settings.

 

Quick Read (beta)

loading the full paper ...