Abstract
Although Vision Language Models (VLMs) have shown strong generalization inmedical imaging, pathology presents unique challenges due to ultra-highresolution, complex tissue structures, and nuanced clinical semantics. Thesefactors make pathology VLMs prone to hallucinations, i.e., generating outputsinconsistent with visual evidence, which undermines clinical trust. ExistingRAG approaches in this domain largely depend on text-based knowledge bases,limiting their ability to leverage diagnostic visual cues. To address this, wepropose Patho-AgenticRAG, a multimodal RAG framework with a database built onpage-level embeddings from authoritative pathology textbooks. Unliketraditional text-only retrieval systems, it supports joint text-image search,enabling direct retrieval of textbook pages that contain both the queried textand relevant visual cues, thus avoiding the loss of critical image-basedinformation. Patho-AgenticRAG also supports reasoning, task decomposition, andmulti-turn search interactions, improving accuracy in complex diagnosticscenarios. Experiments show that Patho-AgenticRAG significantly outperformsexisting multimodal models in complex pathology tasks like multiple-choicediagnosis and visual question answering. Our project is available at thePatho-AgenticRAG repository:https://github.com/Wenchuan-Zhang/Patho-AgenticRAG.