Abstract
Large reasoning models achieve remarkable performance through extensivechain-of-thought generation, yet exhibit significant computational inefficiencyby applying uniform reasoning strategies regardless of problem complexity. Wepresent Hierarchical Budget Policy Optimization (HBPO), a reinforcementlearning framework that enables models to learn problem-specific reasoningdepths without sacrificing capability. HBPO addresses the fundamental challengeof exploration space collapse in efficiency-oriented training, where penaltieson long output length systematically bias models away from necessary longreasoning paths. Through hierarchical budget exploration, our approachpartitions rollout samples into multiple subgroups with distinct token budgets,aiming to enable efficient resource allocation while preventing degradation ofcapability. We introduce differentiated reward mechanisms that createbudget-aware incentives aligned with the complexity of the problem, allowingmodels to discover natural correspondences between task requirements andcomputational effort. Extensive experiments demonstrate that HBPO reducesaverage token usage by up to 60.6% while improving accuracy by 3.14% acrossfour reasoning benchmarks. Unlike existing methods that impose externalconstraints or rely on discrete mode selection, HBPO exhibits emergent adaptivebehavior where models automatically adjust reasoning depth based on problemcomplexity. Our results suggest that reasoning efficiency and capability arenot inherently conflicting, and can be simultaneously optimized throughappropriately structured hierarchical training that preserves explorationdiversity.