MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

Abstract

Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in LargeLanguage Models (LLMs), but it still remains challenging for extending it tomultimodal domains. Existing works either adopt a similar textual reasoning forimage input, or seek to interleave visual signals into mathematical CoT.However, they face three key limitations for math problem-solving: reliance oncoarse-grained box-shaped image regions, limited perception of vision encoderson math content, and dependence on external capabilities for visualmodification. In this paper, we propose MINT-CoT, introducing MathematicalINterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptivelyinterleaves relevant visual tokens into textual reasoning steps via anInterleave Token, which dynamically selects visual regions of any shapes withinmath figures. To empower this capability, we construct the MINT-CoT dataset,containing 54K mathematical problems aligning each reasoning step with visualregions at the token level, accompanied by a rigorous data generation pipeline.We further present a three-stage MINT-CoT training strategy, progressivelycombining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL, whichderives our MINT-CoT-7B model. Extensive experiments demonstrate theeffectiveness of our method for effective visual interleaved reasoning inmathematical domains, where MINT-CoT-7B outperforms the baseline model by+34.08% on MathVista, +28.78% on GeoQA, and +23.2% on MMStar, respectively. Ourcode and data are available at https://github.com/xinyan-cxy/MINT-CoT

Quick Read (beta)

loading the full paper ...