R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

  • 2025-10-21 13:49:36
  • Yi Lu, Jianing Wang, Linsen Guo, Wei He, Hongyin Tang, Tao Gui, Xuanjing Huang, Xuezhi Cao, Wei Wang, Xunliang Cai
  • 0

Abstract

Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1,DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought(CoT). However, existing benchmarks mainly focus on immediate, single-horizontasks, failing to adequately evaluate models' ability to understand and respondto complex, long-horizon scenarios. To address this incomplete evaluation ofLarge Reasoning Models (LRMs), we propose R-HORIZON, a method designed tostimulate long-horizon reasoning behaviors in LRMs through query composition.Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprisingcomplex multi-step reasoning tasks with interdependent problems that span longreasoning horizons. Through comprehensive evaluation of LRMs using theR-HORIZON benchmark, we find that even the most advanced LRMs suffersignificant performance degradation. Our analysis reveals that LRMs exhibitlimited effective reasoning length and struggle to allocate thinking budgetacross multiple problems appropriately. Recognizing these limitations, we useR-HORIZON to construct long-horizon reasoning data for reinforcement learningwith verified rewards (RLVR). Compared to training with single-horizon data,RLVR with R-HORIZON not only substantially improves performance on themulti-horizon reasoning tasks, but also promotes accuracy on standard reasoningtasks, with an increase of 7.5 on AIME2024. These results position R-HORIZON asa scalable, controllable, and low-cost paradigm for enhancing and evaluatingthe long-horizon reasoning capabilities of LRMs.

 

Quick Read (beta)

loading the full paper ...