Abstract
Large reasoning models (LRMs) achieve higher performance on challengingreasoning tasks by generating more tokens at inference time, but this verbosityoften wastes computation on easy problems. Existing solutions, includingsupervised finetuning on shorter traces, user-controlled budgets, or RL withuniform penalties, either require data curation, manual configuration, or treatall problems alike regardless of difficulty. We introduce Adaptive LengthPenalty (ALP), a reinforcement learning objective tailoring generation lengthto per-prompt solve rate. During training, ALP monitors each prompt's onlinesolve rate through multiple rollouts and adds a differentiable penalty whosemagnitude scales inversely with that rate, so confident (easy) prompts incur ahigh cost for extra tokens while hard prompts remain unhindered. PosttrainingDeepScaleR-1.5B with ALP cuts average token usage by 50\% without significantlydropping performance. Relative to fixed-budget and uniform penalty baselines,ALP redistributes its reduced budget more intelligently by cutting compute oneasy prompts and reallocating saved tokens to difficult ones, delivering higheraccuracy on the hardest problems with higher cost.