Training Diffusion Policies via Prior-Mapping Co-Evolution

Abstract

Reinforcement learning (RL) faces a persistent tension: policies that are stable to optimize (e.g., Gaussians) are often too simple to represent the multimodal action distributions required for complex control. Conversely, expressive generative policies -- such as diffusion and flow matching -- can be difficult to optimize in online RL due to intractable likelihoods and gradients propagating through long sampling chains. We address this tension with a key structural principle: decoupling optimization from generation. Building on this, we introduce GoRL (Generative Online Reinforcement Learning), an algorithm-agnostic framework that trains expressive policies from scratch by confining policy optimization to a tractable latent space while delegating action synthesis to a conditional generative decoder. Viewed as prior-mapping co-evolution, each stage first improves a tractable latent prior through RL and then consolidates the resulting behavior into a more expressive prior-to-action mapping. This two-timescale schedule, anchored by fixed-prior decoder refinement, enables stable optimization while continuously expanding expressiveness. Empirically, \textsc{GoRL} consistently outperforms unimodal and generative baselines across diverse continuous-control tasks. Notably, GoRL achieves returns exceeding 870 on HopperStand, more than 3* the strongest baseline; on high-dimensional humanoid tasks, it further outperforms the strongest non-GoRL baseline by over an order of magnitude.

Quick Read (beta)

loading the full paper ...