Learning to Reason at the Frontier of Learnability

Abstract

Reinforcement learning is now widely adopted as the final stage of largelanguage model training, especially for reasoning-style tasks such as mathsproblems. Typically, models attempt each question many times during a singletraining step and attempt to learn from their successes and failures. However,we demonstrate that throughout training with two popular algorithms (PPO andVinePPO) on two widely used datasets, many questions are either solved by allattempts - meaning they are already learned - or by none - providing nomeaningful training signal. To address this, we adapt a method from thereinforcement learning literature - sampling for learnability - and apply it tothe reinforcement learning stage of LLM training. Our curriculum prioritisesquestions with high variance of success, i.e. those where the agent sometimessucceeds, but not always. Our findings demonstrate that this curriculumconsistently boosts training performance across multiple algorithms anddatasets, paving the way for more efficient and effective reinforcementlearning with LLMs.

Quick Read (beta)

loading the full paper ...