Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

Abstract

Recent advancements in large language models (LLMs) have demonstratedimpressive chain-of-thought reasoning capabilities, with reinforcement learning(RL) playing a crucial role in this progress. While "aha moment"patterns--where models exhibit self-correction through reflection--are oftenattributed to emergent properties from RL, we first demonstrate that thesepatterns exist in multimodal LLMs (MLLMs) prior to RL training but may notnecessarily correlate with improved reasoning performance. Building on theseinsights, we present a comprehensive study on enhancing multimodal reasoningthrough a two-stage approach: (1) supervised fine-tuning (SFT) as a cold startwith structured chain-of-thought reasoning patterns, followed by (2)reinforcement learning via GRPO to further refine these capabilities. Ourextensive experiments show that this combined approach consistently outperformsboth SFT-only and RL-only methods across challenging multimodal reasoningbenchmarks. The resulting models achieve state-of-the-art performance amongopen-source MLLMs at both 3B and 7B scales, with our 7B model showingsubstantial improvements over base models (e.g., 66.3 %$\rightarrow$73.4 % onMathVista, 62.9 %$\rightarrow$70.4 % on We-Math) and our 3B model achievingperformance competitive with several 7B models. Overall, this work providespractical guidance for building advanced multimodal reasoning models. Our codeis available at https://github.com/waltonfuture/RL-with-Cold-Start.

Quick Read (beta)

loading the full paper ...