Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases

Abstract

Recent advancements in reasoning-enhanced large language models (LLMs), suchas DeepSeek-R1 and OpenAI-o3, have demonstrated significant progress. However,their application in professional medical contexts remains underexplored,particularly in evaluating the quality of their reasoning processes alongsidefinal outputs. Here, we introduce MedR-Bench, a benchmarking dataset of 1,453structured patient cases, annotated with reasoning references derived fromclinical case reports. Spanning 13 body systems and 10 specialties, it includesboth common and rare diseases. To comprehensively evaluate LLM performance, wepropose a framework encompassing three critical examination recommendation,diagnostic decision-making, and treatment planning, simulating the entirepatient care journey. To assess reasoning quality, we present the ReasoningEvaluator, a novel automated system that objectively scores free-text reasoningresponses based on efficiency, actuality, and completeness using dynamiccross-referencing and evidence checks. Using this benchmark, we evaluate fivestate-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, andGemini-2.0-Flash Thinking, etc. Our results show that current LLMs achieve over85% accuracy in relatively simple diagnostic tasks when provided withsufficient examination results. However, performance declines in more complextasks, such as examination recommendation and treatment planning. Whilereasoning outputs are generally reliable, with factuality scores exceeding 90%,critical reasoning steps are frequently missed. These findings underscore boththe progress and limitations of clinical LLMs. Notably, open-source models likeDeepSeek-R1 are narrowing the gap with proprietary systems, highlighting theirpotential to drive accessible and equitable advancements in healthcare.

Quick Read (beta)

loading the full paper ...