Abstract
The proliferation of Large Language Models (LLMs) in medicine has enabledimpressive capabilities, yet a critical gap remains in their ability to performsystematic, transparent, and verifiable reasoning, a cornerstone of clinicalpractice. This has catalyzed a shift from single-step answer generation to thedevelopment of LLMs explicitly designed for medical reasoning. This paperprovides the first systematic review of this emerging field. We propose ataxonomy of reasoning enhancement techniques, categorized into training-timestrategies (e.g., supervised fine-tuning, reinforcement learning) and test-timemechanisms (e.g., prompt engineering, multi-agent systems). We analyze howthese techniques are applied across different data modalities (text, image,code) and in key clinical applications such as diagnosis, education, andtreatment planning. Furthermore, we survey the evolution of evaluationbenchmarks from simple accuracy metrics to sophisticated assessments ofreasoning quality and visual interpretability. Based on an analysis of 60seminal studies from 2022-2025, we conclude by identifying critical challenges,including the faithfulness-plausibility gap and the need for native multimodalreasoning, and outlining future directions toward building efficient, robust,and sociotechnically responsible medical AI.