MedVLThinker: Simple Baselines for Multimodal Medical Reasoning

Abstract

Large Reasoning Models (LRMs) have introduced a new paradigm in AI byenabling models to ``think before responding" via chain-of-thought reasoning.However, the absence of open and reproducible recipes for buildingreasoning-centric medical LMMs hinders community-wide research, analysis, andcomparison. In this paper, we present MedVLThinker, a suite of simple yetstrong baselines. Our fully open recipe consists of: (1) systematic datacuration for both text-only and image-text medical data, filtered according tovarying levels of reasoning difficulty, and (2) two training paradigms:Supervised Fine-Tuning (SFT) on distilled reasoning traces and ReinforcementLearning with Verifiable Rewards (RLVR) based on final answer correctness.Across extensive experiments on the Qwen2.5-VL model family (3B, 7B) and sixmedical QA benchmarks, we find that RLVR consistently and significantlyoutperforms SFT. Additionally, under the RLVR framework, a key,counter-intuitive finding is that training on our curated text-only reasoningdata provides a more substantial performance boost than training on multimodalimage-text data. Our best open 7B model, trained using the RLVR recipe ontext-only data, establishes a new state-of-the-art on existing public VQAbenchmarks, surpassing all previous open-source medical LMMs. Furthermore,scaling our model to 32B achieves performance on par with the proprietaryGPT-4o. We release all curated data, models, and code to provide the communitywith a strong, open foundation for future research in multimodal medicalreasoning.

Quick Read (beta)

loading the full paper ...