HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

Abstract

Large language models (LLMs), such as ChatGPT, are prone to generatehallucinations, \ie content that conflicts with the source or cannot beverified by the factual knowledge. To understand what types of content and towhich extent LLMs are apt to hallucinate, we introduce the HallucinationEvaluation for Large Language Models (HaluEval) benchmark, a large collectionof generated and human-annotated hallucinated samples for evaluating theperformance of LLMs in recognizing hallucination. To generate these samples, wepropose a ChatGPT-based two-step framework, \ie sampling-then-filtering.Besides, we also hire some human labelers to annotate the hallucinations inChatGPT responses. The empirical results suggest that ChatGPT is likely togenerate hallucinated content in specific topics by fabricating unverifiableinformation (\ie about $11.4\%$ user queries). Moreover, existing LLMs facegreat challenges in recognizing the hallucinations in texts. While, ourexperiments also prove that the hallucination recognition can be improved byproviding external knowledge or adding reasoning steps. Our benchmark can beaccessed at https://github.com/RUCAIBox/HaluEval.

Quick Read (beta)

loading the full paper ...