Abstract
The development of Multimodal Large Language Models (MLLMs) has seensignificant advancements. However, the quantity and quality of multimodalinstruction data have emerged as significant bottlenecks in their progress.Manually creating multimodal instruction data is both time-consuming andinefficient, posing challenges in producing instructions of high complexity.Moreover, distilling instruction data from black-box commercial models (e.g.,GPT-4o, GPT-4V) often results in simplistic instruction data, which constrainsperformance to that of these models. The challenge of curating diverse andcomplex instruction data remains substantial. We propose MMEvol, a novelmultimodal instruction data evolution framework that combines fine-grainedperception evolution, cognitive reasoning evolution, and interaction evolution.This iterative approach breaks through data quality bottlenecks to generate acomplex and diverse image-text instruction dataset, thereby empowering MLLMswith enhanced capabilities. Beginning with an initial set of instructions,SEED-163K, we utilize MMEvol to systematically broadens the diversity ofinstruction types, integrates reasoning steps to enhance cognitivecapabilities, and extracts detailed information from images to improve visualunderstanding and robustness. To comprehensively evaluate the effectiveness ofour data, we train LLaVA-NeXT using the evolved data and conduct experimentsacross 13 vision-language tasks. Compared to the baseline trained with seeddata, our approach achieves an average accuracy improvement of 3.1 points andreaches state-of-the-art (SOTA) performance on 9 of these tasks.