Benchmarking Large Language Models in Retrieval-Augmented Generation

Abstract

Retrieval-Augmented Generation (RAG) is a promising approach for mitigatingthe hallucination of large language models (LLMs). However, existing researchlacks rigorous evaluation of the impact of retrieval-augmented generation ondifferent large language models, which make it challenging to identify thepotential bottlenecks in the capabilities of RAG for different LLMs. In thispaper, we systematically investigate the impact of Retrieval-AugmentedGeneration on large language models. We analyze the performance of differentlarge language models in 4 fundamental abilities required for RAG, includingnoise robustness, negative rejection, information integration, andcounterfactual robustness. To this end, we establish Retrieval-AugmentedGeneration Benchmark (RGB), a new corpus for RAG evaluation in both English andChinese. RGB divides the instances within the benchmark into 4 separatetestbeds based on the aforementioned fundamental abilities required to resolvethe case. Then we evaluate 6 representative LLMs on RGB to diagnose thechallenges of current LLMs when applying RAG. Evaluation reveals that whileLLMs exhibit a certain degree of noise robustness, they still strugglesignificantly in terms of negative rejection, information integration, anddealing with false information. The aforementioned assessment outcomes indicatethat there is still a considerable journey ahead to effectively apply RAG toLLMs.

Quick Read (beta)

loading the full paper ...