Post-Training Large Language Models via Reinforcement Learning from Self-Feedback

Abstract

Large Language Models (LLMs) often produce plausible but poorly-calibratedanswers, limiting their reliability on reasoning-intensive tasks. We presentReinforcement Learning from Self-Feedback (RLSF), a post-training stage thatuses the model's own confidence as an intrinsic reward, mimicking how humanslearn in the absence of external feedback. After a frozen LLM generates severalchain-of-thought solutions, we define and compute the confidence of each finalanswer span and rank the traces accordingly. These synthetic preferences arethen used to fine-tune the policy with standard preference optimization,similar to RLHF yet requiring no human labels, gold answers, or externallycurated rewards. RLSF simultaneously (i) refines the model's probability estimates --restoring well-behaved calibration -- and (ii) strengthens step-by-stepreasoning, yielding improved performance on arithmetic reasoning andmultiple-choice question answering. By turning a model's own uncertainty into useful self-feedback, RLSF affirmsreinforcement learning on intrinsic model behaviour as a principled anddata-efficient component of the LLM post-training pipeline and warrents furtherresearch in intrinsic rewards for LLM post-training.

Quick Read (beta)

loading the full paper ...