Abstract
In this paper, we present a framework for training large language models(LLMs) as diagnostic agents with reinforcement learning, enabling them tomanage multi-turn diagnostic processes, adaptively select examinations, andcommit to final diagnoses. Unlike instruction-tuned models trained on staticcase summaries, our method acquires diagnostic strategies through interactiveexploration and outcome-based feedback. Our contributions are fourfold: (i) Wepresent DiagGym, a diagnostics world model trained with electronic healthrecords that emits examination outcomes conditioned on patient history andrecommended examination, serving as a virtual clinical environment forrealistic diagnosis training and evaluation; (ii) We train DiagAgent viaend-to-end, multi-turn reinforcement learning to learn diagnostic policies thatoptimize both information yield and diagnostic accuracy; (iii) We introduceDiagBench, a diagnostic benchmark comprising 750 cases with physician-validatedexamination recommendations and 99 cases annotated with 973 physician-writtenrubrics on diagnosis process; (iv) we demonstrate superior performance acrossdiverse diagnostic settings. DiagAgent significantly outperforms 10state-of-the-art LLMs, including DeepSeek-v3 and GPT-4o, as well as twoprompt-engineered agents. In single-turn settings, DiagAgent achieves 9.34%higher diagnostic accuracy and 44.03% improvement in examination recommendationhit ratio. In end-to-end settings, it delivers 15.12% increase in diagnosticaccuracy and 23.09% boost in examination recommendation F1 score. Inrubric-based evaluation, it surpasses the next-best model, Claude-sonnet-4, by7.1% in weighted rubric score. These findings indicate that learning policiesin interactive clinical environments confers dynamic and clinically meaningfuldiagnostic management abilities unattainable through passive training alone.