Abstract
Evidence to guide healthcare decisions is often limited by a lack of relevantand trustworthy literature as well as difficulty in contextualizing existingresearch for a specific patient. Large language models (LLMs) could potentiallyaddress both challenges by either summarizing published literature orgenerating new studies based on real-world data (RWD). We evaluated the abilityof five LLM-based systems in answering 50 clinical questions and had nineindependent physicians review the responses for relevance, reliability, andactionability. As it stands, general-purpose LLMs (ChatGPT-4, Claude 3 Opus,Gemini Pro 1.5) rarely produced answers that were deemed relevant andevidence-based (2% - 10%). In contrast, retrieval augmented generation(RAG)-based and agentic LLM systems produced relevant and evidence-basedanswers for 24% (OpenEvidence) to 58% (ChatRWD) of questions. Only the agenticChatRWD was able to answer novel questions compared to other LLMs (65% vs.0-9%). These results suggest that while general-purpose LLMs should not be usedas-is, a purpose-built system for evidence summarization based on RAG and onefor generating novel evidence working synergistically would improveavailability of pertinent evidence for patient care.