Abstract
Multimodal Large Language Models (MLLMs) in real-world applications requireaccess to external knowledge sources and must remain responsive to the dynamicand ever-changing real-world information in order to addressinformation-seeking and knowledge-intensive user queries. Existing approaches,such as retrieval augmented generation (RAG) methods, search agents, and searchequipped MLLMs, often suffer from rigid pipelines, excessive search calls, andpoorly constructed search queries, which result in inefficiencies andsuboptimal outcomes. To address these limitations, we present DeepMMSearch-R1,the first multimodal LLM capable of performing on-demand, multi-turn websearches and dynamically crafting queries for both image and text search tools.Specifically, DeepMMSearch-R1 can initiate web searches based on relevant cropsof the input image making the image search more effective, and can iterativelyadapt text search queries based on retrieved information, thereby enablingself-reflection and self-correction. Our approach relies on a two-stagetraining pipeline: a cold start supervised finetuning phase followed by anonline reinforcement learning optimization. For training, we introduceDeepMMSearchVQA, a novel multimodal VQA dataset created through an automatedpipeline intermixed with real-world information from web search tools. Thisdataset contains diverse, multi-hop queries that integrate textual and visualinformation, teaching the model when to search, what to search for, whichsearch tool to use and how to reason over the retrieved information. We conductextensive experiments across a range of knowledge-intensive benchmarks todemonstrate the superiority of our approach. Finally, we analyze the resultsand provide insights that are valuable for advancing multimodal web-search.