Abstract
Existing multimodal retrieval benchmarks primarily focus on evaluatingwhether models can retrieve and utilize external textual knowledge for questionanswering. However, there are scenarios where retrieving visual information iseither more beneficial or easier to access than textual data. In this paper, weintroduce a multimodal retrieval-augmented generation benchmark, MRAG-Bench, inwhich we systematically identify and categorize scenarios where visuallyaugmented knowledge is better than textual knowledge, for instance, more imagesfrom varying viewpoints. MRAG-Bench consists of 16,130 images and 1,353human-annotated multiple-choice questions across 9 distinct scenarios. WithMRAG-Bench, we conduct an evaluation of 10 open-source and 4 proprietary largevision-language models (LVLMs). Our results show that all LVLMs exhibit greaterimprovements when augmented with images compared to textual knowledge,confirming that MRAG-Bench is vision-centric. Additionally, we conductextensive analysis with MRAG-Bench, which offers valuable insights intoretrieval-augmented LVLMs. Notably, the top-performing model, GPT-4o, faceschallenges in effectively leveraging retrieved knowledge, achieving only a5.82% improvement with ground-truth information, in contrast to a 33.16%improvement observed in human participants. These findings highlight theimportance of MRAG-Bench in encouraging the community to enhance LVLMs' abilityto utilize retrieved visual knowledge more effectively.