Are We on the Right Way for Assessing Document Retrieval-Augmented Generation?

Abstract

Retrieval-Augmented Generation (RAG) systems using Multimodal Large LanguageModels (MLLMs) show great promise for complex document understanding, yet theirdevelopment is critically hampered by inadequate evaluation. Current benchmarksoften focus on specific part of document RAG system and use synthetic data withincomplete ground truth and evidence labels, therefore failing to reflectreal-world bottlenecks and challenges. To overcome these limitations, weintroduce Double-Bench: a new large-scale, multilingual, and multimodalevaluation system that is able to produce fine-grained assessment to eachcomponent within document RAG systems. It comprises 3,276 documents (72,880pages) and 5,168 single- and multi-hop queries across 6 languages and 4document types with streamlined dynamic update support for potential datacontamination issues. Queries are grounded in exhaustively scanned evidencepages and verified by human experts to ensure maximum quality and completeness.Our comprehensive experiments across 9 state-of-the-art embedding models, 4MLLMs and 4 end-to-end document RAG frameworks demonstrate the gap between textand visual embedding models is narrowing, highlighting the need in buildingstronger document retrieval models. Our findings also reveal theover-confidence dilemma within current document RAG frameworks that tend toprovide answer even without evidence support. We hope our fully open-sourceDouble-Bench provide a rigorous foundation for future research in advanceddocument RAG systems. We plan to retrieve timely corpus and release newbenchmarks on an annual basis.

Quick Read (beta)

loading the full paper ...