Answering real-world clinical questions using large language model based systems

Abstract

Evidence to guide healthcare decisions is often limited by a lack of relevantand trustworthy literature as well as difficulty in contextualizing existingresearch for a specific patient. Large language models (LLMs) could potentiallyaddress both challenges by either summarizing published literature orgenerating new studies based on real-world data (RWD). We evaluated the abilityof five LLM-based systems in answering 50 clinical questions and had nineindependent physicians review the responses for relevance, reliability, andactionability. As it stands, general-purpose LLMs (ChatGPT-4, Claude 3 Opus,Gemini Pro 1.5) rarely produced answers that were deemed relevant andevidence-based (2% - 10%). In contrast, retrieval augmented generation(RAG)-based and agentic LLM systems produced relevant and evidence-basedanswers for 24% (OpenEvidence) to 58% (ChatRWD) of questions. Only the agenticChatRWD was able to answer novel questions compared to other LLMs (65% vs.0-9%). These results suggest that while general-purpose LLMs should not be usedas-is, a purpose-built system for evidence summarization based on RAG and onefor generating novel evidence working synergistically would improveavailability of pertinent evidence for patient care.

Quick Read (beta)

loading the full paper ...