Flow-Based Policy for Online Reinforcement Learning

Abstract

We present \textbf{FlowRL}, a novel framework for online reinforcementlearning that integrates flow-based policy representation withWasserstein-2-regularized optimization. We argue that in addition to trainingsignals, enhancing the expressiveness of the policy class is crucial for theperformance gains in RL. Flow-based generative models offer such potential,excelling at capturing complex, multimodal action distributions. However, theirdirect application in online RL is challenging due to a fundamental objectivemismatch: standard flow training optimizes for static data imitation, while RLrequires value-based policy optimization through a dynamic buffer, leading todifficult optimization landscapes. FlowRL first models policies via astate-dependent velocity field, generating actions through deterministic ODEintegration from noise. We derive a constrained policy search objective thatjointly maximizes Q through the flow policy while bounding the Wasserstein-2distance to a behavior-optimal policy implicitly derived from the replaybuffer. This formulation effectively aligns the flow optimization with the RLobjective, enabling efficient and value-aware policy learning despite thecomplexity of the policy class. Empirical evaluations on DMControl andHumanoidbench demonstrate that FlowRL achieves competitive performance inonline reinforcement learning benchmarks.

Quick Read (beta)

loading the full paper ...