Abstract
Large Language Models (LLMs), enhanced through agent tuning, havedemonstrated remarkable capabilities in Chain-of-Thought (CoT) and toolutilization, significantly surpassing the performance of standalone models.However, the multimodal domain still lacks a large-scale, high-quality agenttuning dataset to unlock the full potential of multimodal large languagemodels. To bridge this gap, we introduce MMAT-1M, the first million-scalemultimodal agent tuning dataset designed to support CoT, reflection, anddynamic tool usage. Our dataset is constructed through a novel four-stage dataengine: 1) We first curate publicly available multimodal datasets containingquestion-answer pairs; 2) Then, leveraging GPT-4o, we generate rationales forthe original question-answer pairs and dynamically integrate API calls andRetrieval Augmented Generation (RAG) information through a multi-turn paradigm;3) Furthermore, we refine the rationales through reflection to ensure logicalconsistency and accuracy, creating a multi-turn dialogue dataset with bothRationale and Reflection (RR); 4) Finally, to enhance efficiency, we optionallycompress multi-turn dialogues into a One-turn Rationale and Reflection (ORR)format. By fine-tuning open-source multimodal models on the MMAT-1M, we observesignificant performance gains. For instance, the InternVL2.5-8B-RR modelachieves an average improvement of 2.7% across eight public benchmarks and 8.8%on the RAG benchmark Dyn-VQA, demonstrating the dataset's effectiveness inenhancing multimodal reasoning and tool-based capabilities. The dataset ispublicly available at https://github.com/VIS-MPU-Agent/MMAT-1M.