Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as apowerful paradigm for post-training large reasoning models (LRMs) usingpolicy-gradient methods such as GRPO. To stabilize training, these methodstypically center trajectory rewards by subtracting the empirical mean for eachprompt. Statistically, this centering acts as a control variate (or baseline),reducing the variance of the policy-gradient estimator. Typically, the mean reward is estimated using per-prompt empirical averagesfor each prompt in a batch. Drawing inspiration from Stein's paradox, wepropose using shrinkage estimators that combine per-prompt and across-promptmeans to improve the overall per-prompt mean estimation accuracy --particularly in the low-generation regime typical of RLVR. Theoretically, weconstruct a shrinkage-based baseline that provably yields lower-variancepolicy-gradient estimators across algorithms. Our proposed baseline serves as adrop-in replacement for existing per-prompt mean baselines, requiring noadditional hyper-parameters or computation. Empirically, shrinkage baselinesconsistently outperform standard empirical-mean baselines, leading tolower-variance gradient updates and improved training stability.