WildHallucinations: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries

Abstract

While hallucinations of large language models (LLMs) prevail as a majorchallenge, existing evaluation benchmarks on factuality do not cover thediverse domains of knowledge that the real-world users of LLMs seek informationabout. To bridge this gap, we introduce WildHallucinations, a benchmark thatevaluates factuality. It does so by prompting LLMs to generate informationabout entities mined from user-chatbot conversations in the wild. Thesegenerations are then automatically fact-checked against a systematicallycurated knowledge source collected from web search. Notably, half of thesereal-world entities do not have associated Wikipedia pages. We evaluate 118,785generations from 15 LLMs on 7,919 entities. We find that LLMs consistentlyhallucinate more on entities without Wikipedia pages and exhibit varyinghallucination rates across different domains. Finally, given the same basemodels, adding a retrieval component only slightly reduces hallucinations butdoes not eliminate hallucinations.

Quick Read (beta)

loading the full paper ...