Evolving Diagnostic Agents in a Virtual Clinical Environment

Abstract

In this paper, we present a framework for training large language models(LLMs) as diagnostic agents with reinforcement learning, enabling them tomanage multi-turn diagnostic processes, adaptively select examinations, andcommit to final diagnoses. Unlike instruction-tuned models trained on staticcase summaries, our method acquires diagnostic strategies through interactiveexploration and outcome-based feedback. Our contributions are fourfold: (i) Wepresent DiagGym, a diagnostics world model trained with electronic healthrecords that emits examination outcomes conditioned on patient history andrecommended examination, serving as a virtual clinical environment forrealistic diagnosis training and evaluation; (ii) We train DiagAgent viaend-to-end, multi-turn reinforcement learning to learn diagnostic policies thatoptimize both information yield and diagnostic accuracy; (iii) We introduceDiagBench, a diagnostic benchmark comprising 750 cases with physician-validatedexamination recommendations and 99 cases annotated with 973 physician-writtenrubrics on diagnosis process; (iv) we demonstrate superior performance acrossdiverse diagnostic settings. DiagAgent significantly outperforms 10state-of-the-art LLMs, including DeepSeek-v3 and GPT-4o, as well as twoprompt-engineered agents. In single-turn settings, DiagAgent achieves 9.34%higher diagnostic accuracy and 44.03% improvement in examination recommendationhit ratio. In end-to-end settings, it delivers 15.12% increase in diagnosticaccuracy and 23.09% boost in examination recommendation F1 score. Inrubric-based evaluation, it surpasses the next-best model, Claude-sonnet-4, by7.1% in weighted rubric score. These findings indicate that learning policiesin interactive clinical environments confers dynamic and clinically meaningfuldiagnostic management abilities unattainable through passive training alone.

Quick Read (beta)

loading the full paper ...