Abstract
Multimodal Retrieval Augmented Generation (mRAG) plays an important role inmitigating the "hallucination" issue inherent in multimodal large languagemodels (MLLMs). Although promising, existing heuristic mRAGs typicallypredefined fixed retrieval processes, which causes two issues: (1) Non-adaptiveRetrieval Queries. (2) Overloaded Retrieval Queries. However, these flawscannot be adequately reflected by current knowledge-seeking visual questionanswering (VQA) datasets, since the most required knowledge can be readilyobtained with a standard two-step retrieval. To bridge the dataset gap, wefirst construct Dyn-VQA dataset, consisting of three types of "dynamic"questions, which require complex knowledge retrieval strategies variable inquery, tool, and time: (1) Questions with rapidly changing answers. (2)Questions requiring multi-modal knowledge. (3) Multi-hop questions. Experimentson Dyn-VQA reveal that existing heuristic mRAGs struggle to provide sufficientand precisely relevant knowledge for dynamic questions due to their rigidretrieval processes. Hence, we further propose the first self-adaptive planningagent for multimodal retrieval, OmniSearch. The underlying idea is to emulatethe human behavior in question solution which dynamically decomposes complexmultimodal questions into sub-question chains with retrieval action. Extensiveexperiments prove the effectiveness of our OmniSearch, also provide directionfor advancing mRAG. The code and dataset will be open-sourced athttps://github.com/Alibaba-NLP/OmniSearch.