Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis

Abstract

The emergence of reasoning models and their integration into practical AIchat bots has led to breakthroughs in solving advanced math, deep search, andextractive question answering problems that requires a complex and multi-stepthought process. Yet, a complete understanding of why these models hallucinatemore than general purpose language models is missing. In this investigativestudy, we systematicallyexplore reasoning failures of contemporary languagemodels on multi-hop question answering tasks. We introduce a novel, nuancederror categorization framework that examines failures across three criticaldimensions: the diversity and uniqueness of source documents involved ("hops"),completeness in capturing relevant information ("coverage"), and cognitiveinefficiency ("overthinking"). Through rigorous hu-man annotation, supported bycomplementary automated metrics, our exploration uncovers intricate errorpatterns often hidden by accuracy-centric evaluations. This investigativeapproach provides deeper insights into the cognitive limitations of currentmodels and offers actionable guidance toward enhancing reasoning fidelity,transparency, and robustness in future language modeling efforts.

Quick Read (beta)

loading the full paper ...