Verifiable Accuracy and Abstention Rewards in Curriculum RL to Alleviate Lost-in-Conversation

  • 2025-10-21 15:32:26
  • Ming Li
  • 0

Abstract

Large Language Models demonstrate strong capabilities in single-turninstruction following but suffer from Lost-in-Conversation (LiC), a degradationin performance as information is revealed progressively in multi-turn settings.Motivated by the current progress on Reinforcement Learning with VerifiableRewards (RLVR), we propose Curriculum Reinforcement Learning with VerifiableAccuracy and Abstention Rewards (RLAAR), a framework that encourages models notonly to generate correct answers, but also to judge the solvability ofquestions in the multi-turn conversation setting. Our approach employs acompetence-gated curriculum that incrementally increases dialogue difficulty(in terms of instruction shards), stabilizing training while promotingreliability. Using multi-turn, on-policy rollouts and a mixed-reward system,RLAAR teaches models to balance problem-solving with informed abstention,reducing premature answering behaviors that cause LiC. Evaluated on LiCbenchmarks, RLAAR significantly mitigates LiC performance decay (62.6% to75.1%) and improves calibrated abstention rates (33.5% to 73.4%). Together,these results provide a practical recipe for building multi-turn reliable andtrustworthy LLMs.

 

Quick Read (beta)

loading the full paper ...