Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models

Abstract

Large language models (LLMs) have demonstrated remarkable performance inreasoning tasks, where reinforcement learning (RL) serves as a key algorithmfor enhancing their reasoning capabilities. Currently, there are two mainstreamreward paradigms: model-based rewards and rule-based rewards. However, bothapproaches suffer from limitations: rule-based rewards lack robustness, whilemodel-based rewards are vulnerable to reward hacking. To address these issues,we propose Cooper(Co-optimizing Policy Model and Reward Model), a RL frameworkthat jointly optimizes both the policy model and the reward model. Cooperleverages the high precision of rule-based rewards when identifying correctresponses, and dynamically constructs and selects positive-negative samplepairs for continued training the reward model. This design enhances robustnessand mitigates the risk of reward hacking. To further support Cooper, weintroduce a hybrid annotation strategy that efficiently and accuratelygenerates training data for the reward model. We also propose a reference-basedreward modeling paradigm, where the reward model takes a reference answer asinput. Based on this design, we train a reward model named VerifyRM, whichachieves higher accuracy on VerifyBench compared to other models of the samesize. We conduct reinforcement learning using both VerifyRM and Cooper. Ourexperiments show that Cooper not only alleviates reward hacking but alsoimproves end-to-end RL performance, for instance, achieving a 0.54% gain inaverage accuracy on Qwen2.5-1.5B-Instruct. Our findings demonstrate thatdynamically updating reward model is an effective way to combat reward hacking,providing a reference for better integrating reward models into RL.

Quick Read (beta)

loading the full paper ...