AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

Abstract

Diagnosing and managing a patient is a complex, sequential decision makingprocess that requires physicians to obtain information -- such as which teststo perform -- and to act upon it. Recent advances in artificial intelligence(AI) and large language models (LLMs) promise to profoundly impact clinicalcare. However, current evaluation schemes overrely on static medicalquestion-answering benchmarks, falling short on interactive decision-makingthat is required in real-life clinical work. Here, we present AgentClinic: amultimodal benchmark to evaluate LLMs in their ability to operate as agents insimulated clinical environments. In our benchmark, the doctor agent mustuncover the patient's diagnosis through dialogue and active data collection. Wepresent two open benchmarks: a multimodal image and dialogue environment,AgentClinic-NEJM, and a dialogue-only environment, AgentClinic-MedQA. We embedcognitive and implicit biases both in patient and doctor agents to emulaterealistic interactions between biased agents. We find that introducing biasleads to large reductions in diagnostic accuracy of the doctor agents, as wellas reduced compliance, confidence, and follow-up consultation willingness inpatient agents. Evaluating a suite of state-of-the-art LLMs, we find thatseveral models that excel in benchmarks like MedQA are performing poorly inAgentClinic-MedQA. We find that the LLM used in the patient agent is animportant factor for performance in the AgentClinic benchmark. We show thatboth having limited interactions as well as too many interaction reducesdiagnostic accuracy in doctor agents. The code and data for this work ispublicly available at https://AgentClinic.github.io.

Quick Read (beta)

loading the full paper ...