Abstract
Recent advancements in large language models (LLMs) have demonstratedsubstantial progress in reasoning capabilities, such as DeepSeek-R1, whichleverages rule-based reinforcement learning to enhance logical reasoningsignificantly. However, extending these achievements to multimodal largelanguage models (MLLMs) presents critical challenges, which are frequently morepronounced for Multimodal Small Language Models (MSLMs) given their typicallyweaker foundational reasoning abilities: (1) the scarcity of high-qualitymultimodal reasoning datasets, (2) the degradation of reasoning capabilitiesdue to the integration of visual processing, and (3) the risk that directapplication of reinforcement learning may produce complex yet incorrectreasoning processes. To address these challenges, we design a novel frameworkInfi-MMR to systematically unlock the reasoning potential of MSLMs through acurriculum of three carefully structured phases and propose our multimodalreasoning model Infi-MMR-3B. The first phase, Foundational ReasoningActivation, leverages high-quality textual reasoning datasets to activate andstrengthen the model's logical reasoning capabilities. The second phase,Cross-Modal Reasoning Adaptation, utilizes caption-augmented multimodal data tofacilitate the progressive transfer of reasoning skills to multimodal contexts.The third phase, Multimodal Reasoning Enhancement, employs curated,caption-free multimodal data to mitigate linguistic biases and promote robustcross-modal reasoning. Infi-MMR-3B achieves both state-of-the-art multimodalmath reasoning ability (43.68% on MathVerse testmini, 27.04% on MathVisiontest, and 21.33% on OlympiadBench) and general reasoning ability (67.2% onMathVista testmini). Resources are available athttps://huggingface.co/Reallm-Labs/Infi-MMR-3B.