Effective Training Data Synthesis for Improving MLLM Chart Understanding

Abstract

Being able to effectively read scientific plots, or chart understanding, is acentral part toward building effective agents for science. However, existingmultimodal large language models (MLLMs), especially open-source ones, arestill falling behind with a typical success rate of 30%-50% on challengingbenchmarks. Previous studies on fine-tuning MLLMs with synthetic charts areoften restricted by their inadequate similarity to the real charts, which couldcompromise model training and performance on complex real-world charts. In thisstudy, we show that modularizing chart generation and diversifying visualdetails improves chart understanding capabilities. In particular, we design afive-step data synthesis pipeline, where we separate data and function creationfor single plot generation, condition the generation of later subplots onearlier ones for multi-subplot figures, visually diversify the generatedfigures, filter out low quality data, and finally generate the question-answer(QA) pairs with GPT-4o. This approach allows us to streamline the generation offine-tuning datasets and introduce the effective chart dataset (ECD), whichcontains 10k+ chart images and 300k+ QA pairs, covering 25 topics and featuring250+ chart type combinations with high visual complexity. We show that ECDconsistently improves the performance of various MLLMs on a range of real-worldand synthetic test sets. Code, data and models are available at:https://github.com/yuweiyang-anu/ECD.

Quick Read (beta)

loading the full paper ...