Abstract
Large Language Models (LLMs) suffer from hallucinations and outdatedknowledge due to their reliance on static training data. Retrieval-AugmentedGeneration (RAG) mitigates these issues by integrating external dynamicinformation for improved factual grounding. With advances in multimodallearning, Multimodal RAG extends this approach by incorporating multiplemodalities such as text, images, audio, and video to enhance the generatedoutputs. However, cross-modal alignment and reasoning introduce uniquechallenges beyond those in unimodal RAG. This survey offers a structured andcomprehensive analysis of Multimodal RAG systems, covering datasets,benchmarks, metrics, evaluation, methodologies, and innovations in retrieval,fusion, augmentation, and generation. We review training strategies, robustnessenhancements, loss functions, and agent-based approaches, while also exploringthe diverse Multimodal RAG scenarios. In addition, we outline open challengesand future directions to guide research in this evolving field. This surveylays the foundation for developing more capable and reliable AI systems thateffectively leverage multimodal dynamic external knowledge bases. All resourcesare publicly available at https://github.com/llm-lab-org/Multimodal-RAG-Survey.