Spurious Rewards: Rethinking Training Signals in RLVR

Abstract

We show that reinforcement learning with verifiable rewards (RLVR) can elicitstrong mathematical reasoning in certain models even with spurious rewards thathave little, no, or even negative correlation with the correct answer. Forexample, RLVR improves MATH-500 performance for Qwen2.5-Math-7B in absolutepoints by 21.4% (random reward), 13.8% (format reward), 24.1% (incorrectlabel), 26.0% (1-shot RL), and 27.1% (majority voting) -- nearly matching the29.1% gained with ground truth rewards. However, the spurious rewards that workfor Qwen often fail to yield gains with other model families like Llama3 orOLMo2. In particular, we find code reasoning -- thinking in code without actualcode execution -- to be a distinctive Qwen2.5-Math behavior that becomessignificantly more frequent after RLVR, from 65% to over 90%, even withspurious rewards. Overall, we hypothesize that, given the lack of useful rewardsignal, RLVR must somehow be surfacing useful reasoning representations learnedduring pretraining, although the exact mechanism remains a topic for futurework. We suggest that future RLVR research should possibly be validated ondiverse models rather than a single de facto choice, as we show that it is easyto get significant performance gains on Qwen models even with completelyspurious reward signals.

Quick Read (beta)

loading the full paper ...