Abstract
Reinforcement learning with verifiable rewards (RLVR) has advanced thereasoning capabilities of large language models (LLMs). However, prevailingRLVR methods exhibit a systematic bias toward exploitation over exploration, asevidenced by improved pass@1 but reduced pass@K (K>1) performance. Tounderstand this issue, we analyze training dynamics of RLVR methods by trackingthe token-level probability distributions over vocabulary candidates. Ouranalysis reveals a consistent probability concentration effect where the top-1candidate increasingly accumulates probability mass and suppresses that ofother candidates. More importantly, stronger over-concentration correlates withworse pass@K performance. Inspired by this finding, we propose Simple Pass@KOptimization (SimKO), a method designed to mitigate the over-concentrationissue, thereby encouraging exploration. SimKO operates in an asymmetricalmanner. For verified-correct responses, it boosts the probabilities of thetop-K candidates. For verified-incorrect responses, it applies strongerpenalties to the top-1 candidate. We observe that this asymmetric design isparticularly effective at mitigating over-concentration when applied at tokenswith high entropy. Across various math and logical-reasoning benchmarks, SimKOconsistently yields higher pass@K for a wide range of K, providing a simple wayto improve RLVR's exploration.