Abstract
Reinforcement learning with verifiable rewards (RLVR) has emerged as apowerful paradigm for enhancing reasoning capabilities in large languagemodels. However, it is constrained by a fundamental asymmetry in computationand memory requirements: rollout generation is embarrassingly parallel andmemory-light, whereas policy updates are communication-heavy andmemory-intensive. To address this, we introduce PODS (Policy Optimization withDown-Sampling). PODS produces numerous rollouts in parallel, then trains ononly an informative subset, preserving learning signals while slashing updatecost. We instantiate PODS with max-variance down-sampling, a principledcriterion that maximises reward diversity and show it admits an $O(n\log n)$solution. Empirically, coupling PODS with Group Relative Policy Optimization(GRPO) achieves superior performance over standard GRPO across differentreasoning benchmarks and hardware environments.