MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

Abstract

Natural language image-caption datasets, widely used for training LargeMultimodal Models, mainly focus on natural scenarios and overlook the intricatedetails of mathematical figures that are critical for problem-solving,hindering the advancement of current LMMs in multimodal mathematical reasoning.To this end, we propose leveraging code as supervision for cross-modalalignment, since code inherently encodes all information needed to generatecorresponding figures, establishing a precise connection between the twomodalities. Specifically, we co-develop our image-to-code model and datasetwith model-in-the-loop approach, resulting in an image-to-code model,FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date.Furthermore, we utilize FigCodifier to synthesize novel mathematical figuresand then construct MM-MathInstruct-3M, a high-quality multimodal mathinstruction fine-tuning dataset. Finally, we present MathCoder-VL, trained withImgCode-8.6M for cross-modal alignment and subsequently fine-tuned onMM-MathInstruct-3M for multimodal math problem solving. Our model achieves anew open-source SOTA across all six metrics. Notably, it surpasses GPT-4o andClaude 3.5 Sonnet in the geometry problem-solving subset of MathVista,achieving improvements of 8.9% and 9.2%. The dataset and models will bereleased at https://github.com/mathllm/MathCoder.

Quick Read (beta)

loading the full paper ...