Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

Abstract

When language models (LMs) are trained via reinforcement learning (RL) togenerate natural language "reasoning chains", their performance improves on avariety of difficult question answering tasks. Today, almost all successfulapplications of RL for reasoning use binary reward functions that evaluate thecorrectness of LM outputs. Because such reward functions do not penalizeguessing or low-confidence outputs, they often have the unintended side-effectof degrading calibration and increasing the rate at which LMs generateincorrect responses (or "hallucinate") in other problem domains. This paperdescribes RLCR (Reinforcement Learning with Calibration Rewards), an approachto training reasoning models that jointly improves accuracy and calibratedconfidence estimation. During RLCR, LMs generate both predictions and numericalconfidence estimates after reasoning. They are trained to optimize a rewardfunction that augments a binary correctness score with a Brier score -- ascoring rule for confidence estimates that incentivizes calibrated prediction.We first prove that this reward function (or any analogous reward function thatuses a bounded, proper scoring rule) yields models whose predictions are bothaccurate and well-calibrated. We next show that across diverse datasets, RLCRsubstantially improves calibration with no loss in accuracy, on both in-domainand out-of-domain evaluations -- outperforming both ordinary RL training andclassifiers trained to assign post-hoc confidence scores. While ordinary RLhurts calibration, RLCR improves it. Finally, we demonstrate that verbalizedconfidence can be leveraged at test time to improve accuracy and calibrationvia confidence-weighted scaling methods. Our results show that explicitlyoptimizing for calibration can produce more generally reliable reasoningmodels.

Quick Read (beta)

loading the full paper ...