Co-Reward: Self-supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement

Abstract

Although reinforcement learning with verifiable rewards (RLVR) shows promisein improving the reasoning ability of large language models (LLMs), the scalingup dilemma remains due to the reliance on human annotated labels especially forcomplex tasks. Recent alternatives that explore various self-reward signalsexhibit the eliciting potential of LLM reasoning, but suffer from thenon-negligible collapse issue. Inspired by the success of self-supervisedlearning, we propose \textit{Co-Reward}, a novel RL framework that leveragescontrastive agreement across semantically analogical questions as a rewardbasis. Specifically, we construct a similar question for each training sample(without labels) and synthesize their individual surrogate labels through asimple rollout voting, and then the reward is constructed by cross-referringthe labels of each question pair to enforce the internal reasoning consistencyacross analogical inputs. Intuitively, such a self-supervised reward-shapingmechanism increases the difficulty of learning collapse into a trivialsolution, and promotes stable reasoning elicitation and improvement throughexpanding the input sample variants. Empirically, Co-Reward achieves superiorperformance compared to other self-reward baselines on multiple reasoningbenchmarks and LLM series, and reaches or even surpasses ground-truth (GT)labeled reward, with improvements of up to $+6.8\%$ on MATH500 over GT rewardon Llama-3.2-3B-Instruct. Our code is publicly available athttps://github.com/tmlr-group/Co-Reward.

Quick Read (beta)

loading the full paper ...