HoH: A Dynamic Benchmark for Evaluating the Impact of Outdated Information on Retrieval-Augmented Generation

Abstract

While Retrieval-Augmented Generation (RAG) has emerged as an effectiveapproach for addressing the knowledge outdating problem in Large LanguageModels (LLMs), it still faces a critical challenge: the prevalence of outdatedinformation in knowledge bases. Current research primarily focuses onincorporating up-to-date information, yet the impact of outdated informationcoexisting in retrieval sources remains inadequately addressed. To bridge thisgap, we introduce HoH, the first benchmark specifically designed to evaluatethe impact of outdated information on RAG. Our benchmark leverages token-leveldiff algorithms combined with LLM pipelines to efficiently create a large-scaleQA dataset that accurately captures the evolution of temporal knowledge inreal-world facts. Through comprehensive experiments, we reveal that outdatedinformation significantly degrades RAG performance in two critical ways: (1) itsubstantially reduces response accuracy by distracting models from correctinformation, and (2) it can mislead models into generating potentially harmfuloutputs, even when current information is available. Current RAG approachesstruggle with both retrieval and generation aspects when handling outdatedinformation. These findings highlight the urgent need for innovative solutionsto address the temporal challenges in RAG. Our code and data are available at:https://github.com/0russwest0/HoH.

Quick Read (beta)

loading the full paper ...