When Bias Pretends to Be Truth: How Spurious Correlations Undermine Hallucination Detection in LLMs

Abstract

Despite substantial advances, large language models (LLMs) continue toexhibit hallucinations, generating plausible yet incorrect responses. In thispaper, we highlight a critical yet previously underexplored class ofhallucinations driven by spurious correlations -- superficial but statisticallyprominent associations between features (e.g., surnames) and attributes (e.g.,nationality) present in the training data. We demonstrate that these spuriouscorrelations induce hallucinations that are confidently generated, immune tomodel scaling, evade current detection methods, and persist even after refusalfine-tuning. Through systematically controlled synthetic experiments andempirical evaluations on state-of-the-art open-source and proprietary LLMs(including GPT-5), we show that existing hallucination detection methods, suchas confidence-based filtering and inner-state probing, fundamentally fail inthe presence of spurious correlations. Our theoretical analysis furtherelucidates why these statistical biases intrinsically undermineconfidence-based detection techniques. Our findings thus emphasize the urgentneed for new approaches explicitly designed to address hallucinations caused byspurious correlations.

Quick Read (beta)

loading the full paper ...