Abstract
Recent deep-thinking large language models often reason extensively toimprove performance, but such lengthy reasoning is not always desirable, as itincurs excessive inference costs with disproportionate performance gains.Controlling reasoning length without sacrificing performance is thereforeimportant, but remains challenging, especially under tight thinking budgets. Wepropose budget guidance, a simple yet effective method for steering thereasoning process of LLMs toward a target budget without requiring any LLMfine-tuning. Our approach introduces a lightweight predictor that models aGamma distribution over the remaining thinking length during next-tokengeneration. This signal is then used to guide generation in a soft, token-levelmanner, ensuring that the overall reasoning trace adheres to the specifiedthinking budget. Budget guidance enables natural control of the thinkinglength, along with significant token efficiency improvements over baselinemethods on challenging math benchmarks. For instance, it achieves up to a 26%accuracy gain on the MATH-500 benchmark under tight budgets compared tobaseline methods, while maintaining competitive accuracy with only 63% of thethinking tokens used by the full-thinking model. Budget guidance alsogeneralizes to broader task domains and exhibits emergent capabilities, such asestimating question difficulty. The source code is available at:https://github.com/UMass-Embodied-AGI/BudgetGuidance.