(Im)possibility of Automated Hallucination Detection in Large Language Models

Abstract

Is automated hallucination detection possible? In this work, we introduce atheoretical framework to analyze the feasibility of automatically detectinghallucinations produced by large language models (LLMs). Inspired by theclassical Gold-Angluin framework for language identification and its recentadaptation to language generation by Kleinberg and Mullainathan, we investigatewhether an algorithm, trained on examples drawn from an unknown target language$K$ (selected from a countable collection) and given access to an LLM, canreliably determine whether the LLM's outputs are correct or constitutehallucinations. First, we establish an equivalence between hallucination detection and theclassical task of language identification. We prove that any hallucinationdetection method can be converted into a language identification method, andconversely, algorithms solving language identification can be adapted forhallucination detection. Given the inherent difficulty of languageidentification, this implies that hallucination detection is fundamentallyimpossible for most language collections if the detector is trained using onlycorrect examples from the target language. Second, we show that the use of expert-labeled feedback, i.e., training thedetector with both positive examples (correct statements) and negative examples(explicitly labeled incorrect statements), dramatically changes thisconclusion. Under this enriched training regime, automated hallucinationdetection becomes possible for all countable language collections. These results highlight the essential role of expert-labeled examples intraining hallucination detectors and provide theoretical support forfeedback-based methods, such as reinforcement learning with human feedback(RLHF), which have proven critical for reliable LLM deployment.

Quick Read (beta)

loading the full paper ...