Evaluating LLMs' Inherent Multi-hop Reasoning Ability

Abstract

While Large Language Models (LLMs) excel in question-answering (QA) tasks,their multi-step reasoning abilities on multiple evidence integration onMulti-hop QA tasks remain underexplored. LLMs sometimes generate answers thatrely on internal memory rather than reasoning given context, which bringsconcerns about the evaluation quality of real reasoning abilities. Thecounterfactual QA task can separate internal memory from reasoning abilities,but focusing solely on final-QA performance without evaluating the multi-stepreasoning process is insufficient for reporting LLMs' real reasoning abilities.Current Multi-hop QA (MHQA) benchmarks are factual and annotated on open-sourcecorpora such as Wikipedia, although useful for multi-step reasoning evaluation,showing limitations due to potential data contamination in LLMs pre-trainingstage. To address this issue, we introduce the Inherent Reasoning Evaluation(IRE) method, a novel evaluation way that jointly evaluates the LLMs'chain-of-reasoning performance based on the first knowledge-editedcounterfactual multi-hop QA data which involves editing the original Wikipediapassages, reducing data contamination risks. The IRE comprehensively assessesreasoning chains through sub-QA and final-QA evaluations. Our comparisonsreveal significant performance gaps for several LLMs between Wikipedia-basedbenchmarks and IRE, deeming data contamination issues in existing benchmarks.We believe that the IRE benchmark will enhance and facilitate trustworthy LLMevaluations.

Quick Read (beta)

loading the full paper ...