Abstract
The emergence of reasoning models and their integration into practical AIchat bots has led to breakthroughs in solving advanced math, deep search, andextractive question answering problems that requires a complex and multi-stepthought process. Yet, a complete understanding of why these models hallucinatemore than general purpose language models is missing. In this investigativestudy, we systematicallyexplore reasoning failures of contemporary languagemodels on multi-hop question answering tasks. We introduce a novel, nuancederror categorization framework that examines failures across three criticaldimensions: the diversity and uniqueness of source documents involved ("hops"),completeness in capturing relevant information ("coverage"), and cognitiveinefficiency ("overthinking"). Through rigorous hu-man annotation, supported bycomplementary automated metrics, our exploration uncovers intricate errorpatterns often hidden by accuracy-centric evaluations. This investigativeapproach provides deeper insights into the cognitive limitations of currentmodels and offers actionable guidance toward enhancing reasoning fidelity,transparency, and robustness in future language modeling efforts.