Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged asa key paradigm for post-training Large Language Models (LLMs), particularly forcomplex reasoning tasks. However, vanilla RLVR training has been shown toimprove Pass@1 performance at the expense of policy entropy, leading to reducedgeneration diversity and limiting the Pass@k performance, which typicallyrepresents the upper bound of LLM reasoning capability. In this paper, wesystematically analyze the policy's generation diversity from the perspectiveof training problems and find that augmenting and updating training problemshelps mitigate entropy collapse during training. Based on these observations,we propose an online Self-play with Variational problem Synthesis (SvS)strategy for RLVR training, which uses the policy's correct solutions tosynthesize variational problems while ensuring their reference answers remainidentical to the originals. This self-improving strategy effectively maintainspolicy entropy during training and substantially improves Pass@k compared withstandard RLVR, sustaining prolonged improvements and achieving absolute gainsof 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 andAIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying modelsizes from 3B to 32B consistently demonstrate the generalizability androbustness of SvS.