Abstract
We present \textbf{FlowRL}, a novel framework for online reinforcementlearning that integrates flow-based policy representation withWasserstein-2-regularized optimization. We argue that in addition to trainingsignals, enhancing the expressiveness of the policy class is crucial for theperformance gains in RL. Flow-based generative models offer such potential,excelling at capturing complex, multimodal action distributions. However, theirdirect application in online RL is challenging due to a fundamental objectivemismatch: standard flow training optimizes for static data imitation, while RLrequires value-based policy optimization through a dynamic buffer, leading todifficult optimization landscapes. FlowRL first models policies via astate-dependent velocity field, generating actions through deterministic ODEintegration from noise. We derive a constrained policy search objective thatjointly maximizes Q through the flow policy while bounding the Wasserstein-2distance to a behavior-optimal policy implicitly derived from the replaybuffer. This formulation effectively aligns the flow optimization with the RLobjective, enabling efficient and value-aware policy learning despite thecomplexity of the policy class. Empirical evaluations on DMControl andHumanoidbench demonstrate that FlowRL achieves competitive performance inonline reinforcement learning benchmarks.