Abstract
Reinforcement learning (RL) has emerged as a powerful paradigm for enhancingreasoning capabilities in large language models, but faces a fundamentalasymmetry in computation and memory requirements: inference is embarrassinglyparallel with a minimal memory footprint, while policy updates requireextensive synchronization and are memory-intensive. To address this asymmetry,we introduce PODS (Policy Optimization with Down-Sampling), a framework thatstrategically decouples these phases by generating numerous rollouts inparallel but updating only on an informative subset. Within this framework, wedevelop max-variance down-sampling, a theoretically motivated method thatselects rollouts with maximally diverse reward signals. We prove that thisapproach has an efficient algorithmic solution, and empirically demonstratethat GRPO with PODS using max-variance down-sampling achieves superiorperformance over standard GRPO on the GSM8K benchmark.