Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become an effectivepost-training method for improving the reasoning abilities of Large LanguageModels (LLMs), mainly by shaping higher-order behaviors such as reflection andplanning. However, previous RLVR algorithms often apply uniform trainingsignals to all tokens, without considering the different roles of low-entropyknowledge-related tokens and high-entropy reasoning-related tokens. Some recentmethods try to separate these token types by gradient masking or asynchronousupdates, but these approaches may break semantic dependencies in the modeloutput and hinder effective learning. In this work, we propose Archer, anentropy-aware RLVR approach with dual-token constraints and synchronousupdates. Specifically, our method applies weaker KL regularization and higherclipping thresholds to reasoning tokens to encourage exploration, while usingstronger constraints on knowledge tokens to maintain factual knowledge.Experimental results on several mathematical reasoning and code generationbenchmarks show that our approach significantly outperforms previous RLVRmethods, reaching or exceeding state-of-the-art performance among models ofcomparable size. The code is available athttps://github.com/wizard-III/ArcherCodeR.

Quick Read (beta)

loading the full paper ...