CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation

Abstract

Chest X-rays (CXRs) are the most frequently performed imaging test inclinical practice. Recent advances in the development of vision-languagefoundation models (FMs) give rise to the possibility of performing automatedCXR interpretation, which can assist physicians with clinical decision-makingand improve patient outcomes. However, developing FMs that can accuratelyinterpret CXRs is challenging due to the (1) limited availability oflarge-scale vision-language datasets in the medical image domain, (2) lack ofvision and language encoders that can capture the complexities of medical data,and (3) absence of evaluation frameworks for benchmarking the abilities of FMson CXR interpretation. In this work, we address these challenges by firstintroducing \emph{CheXinstruct} - a large-scale instruction-tuning datasetcurated from 28 publicly-available datasets. We then present \emph{CheXagent} -an instruction-tuned FM capable of analyzing and summarizing CXRs. To buildCheXagent, we design a clinical large language model (LLM) for parsingradiology reports, a vision encoder for representing CXR images, and a networkto bridge the vision and language modalities. Finally, we introduce\emph{CheXbench} - a novel benchmark designed to systematically evaluate FMsacross 8 clinically-relevant CXR interpretation tasks. Extensive quantitativeevaluations and qualitative reviews with five expert radiologists demonstratethat CheXagent outperforms previously-developed general- and medical-domain FMson CheXbench tasks. Furthermore, in an effort to improve model transparency, weperform a fairness evaluation across factors of sex, race and age to highlightpotential performance disparities. Our project is at\url{https://stanford-aimi.github.io/chexagent.html}.

Quick Read (beta)

loading the full paper ...