The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason

Abstract

Recent studies on post-training large language models (LLMs) for reasoningthrough reinforcement learning (RL) typically focus on tasks that can beaccurately verified and rewarded, such as solving math problems. In contrast,our research investigates the impact of reward noise, a more practicalconsideration for real-world scenarios involving the post-training of LLMsusing reward models. We found that LLMs demonstrate strong robustness tosubstantial reward noise. For example, manually flipping 40% of the rewardfunction's outputs in math tasks still allows a Qwen-2.5-7B model to achieverapid convergence, improving its performance on math tasks from 5% to 72%,compared to the 75% accuracy achieved by a model trained with noiselessrewards. Surprisingly, by only rewarding the appearance of key reasoningphrases (namely reasoning pattern reward, RPR), such as ``first, I needto''-without verifying the correctness of answers, the model achieved peakdownstream performance (over 70% accuracy for Qwen-2.5-7B) comparable to modelstrained with strict correctness verification and accurate rewards. Recognizingthe importance of the reasoning process over the final results, we combined RPRwith noisy reward models. RPR helped calibrate the noisy reward models,mitigating potential false negatives and enhancing the LLM's performance onopen-ended tasks. These findings suggest the importance of improving models'foundational abilities during the pre-training phase while providing insightsfor advancing post-training techniques. Our code and scripts are available athttps://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason.

Quick Read (beta)

loading the full paper ...