Evaluating Vision Language Models (VLMs) for Radiology: A Comprehensive Analysis

  • 2025-04-22 18:20:34
  • Frank Li, Hari Trivedi, Bardia Khosravi, Theo Dapamede, Mohammadreza Chavoshi, Abdulhameed Dere, Rohan Satya Isaac, Aawez Mansuri, Janice Newsome, Saptarshi Purkayastha, Judy Gichoya
  • 0

Abstract

Foundation models, trained on vast amounts of data using self-supervisedtechniques, have emerged as a promising frontier for advancing artificialintelligence (AI) applications in medicine. This study evaluates threedifferent vision-language foundation models (RAD-DINO, CheXagent, andBiomedCLIP) on their ability to capture fine-grained imaging features forradiology tasks. The models were assessed across classification, segmentation,and regression tasks for pneumothorax and cardiomegaly on chest radiographs.Self-supervised RAD-DINO consistently excelled in segmentation tasks, whiletext-supervised CheXagent demonstrated superior classification performance.BiomedCLIP showed inconsistent performance across tasks. A custom segmentationmodel that integrates global and local features substantially improvedperformance for all foundation models, particularly for challengingpneumothorax segmentation. The findings highlight that pre-training methodologysignificantly influences model performance on specific downstream tasks. Forfine-grained segmentation tasks, models trained without text supervisionperformed better, while text-supervised models offered advantages inclassification and interpretability. These insights provide guidance forselecting foundation models based on specific clinical applications inradiology.

 

Quick Read (beta)

loading the full paper ...