Disentangling Reasoning and Knowledge in Medical Large Language Models

Abstract

Medical reasoning in large language models (LLMs) aims to emulate clinicians'diagnostic thinking, but current benchmarks such as MedQA-USMLE, MedMCQA, andPubMedQA often mix reasoning with factual recall. We address this by separating11 biomedical QA benchmarks into reasoning- and knowledge-focused subsets usinga PubMedBERT classifier that reaches 81 percent accuracy, comparable to humanperformance. Our analysis shows that only 32.8 percent of questions requirecomplex reasoning. We evaluate biomedical models (HuatuoGPT-o1, MedReason, m1)and general-domain models (DeepSeek-R1, o4-mini, Qwen3), finding consistentgaps between knowledge and reasoning performance. For example, m1 scores 60.5on knowledge but only 47.1 on reasoning. In adversarial tests where models aremisled with incorrect initial reasoning, biomedical models degrade sharply,while larger or RL-trained general models show more robustness. To addressthis, we train BioMed-R1 using fine-tuning and reinforcement learning onreasoning-heavy examples. It achieves the strongest performance among similarlysized models. Further gains may come from incorporating clinical case reportsand training with adversarial and backtracking scenarios.

Quick Read (beta)

loading the full paper ...