Abstract
Recent advances in vision language models (VLMs) have enabled broad progressin the general medical field. However, pathology still remains a morechallenging subdomain, with current pathology specific VLMs exhibitinglimitations in both diagnostic accuracy and reasoning plausibility. Suchshortcomings are largely attributable to the nature of current pathologydatasets, which are primarily composed of image description pairs that lack thedepth and structured diagnostic paradigms employed by real world pathologists.In this study, we leverage pathology textbooks and real world pathology expertsto construct high-quality, reasoning-oriented datasets. Building on this, weintroduce Patho-R1, a multimodal RL-based pathology Reasoner, trained through athree-stage pipeline: (1) continued pretraining on 3.5 million image-text pairsfor knowledge infusion; (2) supervised fine-tuning on 500k high-qualityChain-of-Thought samples for reasoning incentivizing; (3) reinforcementlearning using Group Relative Policy Optimization and Decoupled Clip andDynamic sAmpling Policy Optimization strategies for multimodal reasoningquality refinement. To further assess the alignment quality of our dataset, wepropose Patho-CLIP, trained on the same figure-caption corpus used forcontinued pretraining. Comprehensive experimental results demonstrate that bothPatho-CLIP and Patho-R1 achieve robust performance across a wide range ofpathology-related tasks, including zero-shot classification, cross-modalretrieval, Visual Question Answering, and Multiple Choice Question. Our projectis available at the Patho-R1 repository:https://github.com/Wenchuan-Zhang/Patho-R1.