DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models

Abstract

The emergence of groundbreaking large language models capable of performingcomplex reasoning tasks holds significant promise for addressing variousscientific challenges, including those arising in complex clinical scenarios.To enable their safe and effective deployment in real-world healthcaresettings, it is urgently necessary to benchmark the diagnostic capabilities ofcurrent models systematically. Given the limitations of existing medicalbenchmarks in evaluating advanced diagnostic reasoning, we presentDiagnosisArena, a comprehensive and challenging benchmark designed torigorously assess professional-level diagnostic competence. DiagnosisArenaconsists of 1,113 pairs of segmented patient cases and corresponding diagnoses,spanning 28 medical specialties, deriving from clinical case reports publishedin 10 top-tier medical journals. The benchmark is developed through ameticulous construction pipeline, involving multiple rounds of screening andreview by both AI systems and human experts, with thorough checks conducted toprevent data leakage. Our study reveals that even the most advanced reasoningmodels, o3-mini, o1, and DeepSeek-R1, achieve only 45.82%, 31.09%, and 17.79%accuracy, respectively. This finding highlights a significant generalizationbottleneck in current large language models when faced with clinical diagnosticreasoning challenges. Through DiagnosisArena, we aim to drive furtheradvancements in AIs diagnostic reasoning capabilities, enabling more effectivesolutions for real-world clinical diagnostic challenges. We provide thebenchmark and evaluation tools for further research and developmenthttps://github.com/SPIRAL-MED/DiagnosisArena.

Quick Read (beta)

loading the full paper ...