Maximizing Confidence Alone Improves Reasoning

Abstract

Reinforcement learning (RL) has enabled machine learning models to achievesignificant advances in many fields. Most recently, RL has empowered frontierlanguage models to solve challenging math, science, and coding problems.However, central to any RL algorithm is the reward function, and rewardengineering is a notoriously difficult problem in any domain. In this paper, wepropose RENT: Reinforcement Learning via Entropy Minimization -- a fullyunsupervised RL method that requires no external reward or ground-truthanswers, and instead uses the model's entropy of its underlying distribution asan intrinsic reward. We find that by reinforcing the chains of thought thatyield high model confidence on its generated answers, the model improves itsreasoning ability. In our experiments, we showcase these improvements on anextensive suite of commonly-used reasoning benchmarks, including GSM8K,MATH500, AMC, AIME, and GPQA, and models of varying sizes from the Qwen andMistral families. The generality of our unsupervised learning method lendsitself to applicability in a wide range of domains where external supervisionis limited or unavailable.

Quick Read (beta)

loading the full paper ...