Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning

Abstract

Recent advances of Reinforcement Learning (RL) have highlighted its potentialin complex reasoning tasks, yet effective training often relies on externalsupervision, which limits the broader applicability. In this work, we propose anovel self-rewarding reinforcement learning framework to enhance Large LanguageModel (LLM) reasoning by leveraging the consistency of intermediate reasoningstates across different reasoning trajectories. Our key insight is that correctresponses often exhibit consistent trajectory patterns in terms of modellikelihood: their intermediate reasoning states tend to converge toward theirown final answers (high consistency) with minimal deviation toward othercandidates (low volatility). Inspired by this observation, we introduce CoVo,an intrinsic reward mechanism that integrates Consistency and Volatility via arobust vector-space aggregation strategy, complemented by a curiosity bonus topromote diverse exploration. CoVo enables LLMs to perform RL in aself-rewarding manner, offering a scalable pathway for learning to reasonwithout external supervision. Extensive experiments on diverse reasoningbenchmarks show that CoVo achieves performance comparable to or even surpassingsupervised RL. Our code is available at https://github.com/sastpg/CoVo.

Quick Read (beta)

loading the full paper ...