Abstract
Foundation models, trained on vast amounts of data using self-supervisedtechniques, have emerged as a promising frontier for advancing artificialintelligence (AI) applications in medicine. This study evaluates threedifferent vision-language foundation models (RAD-DINO, CheXagent, andBiomedCLIP) on their ability to capture fine-grained imaging features forradiology tasks. The models were assessed across classification, segmentation,and regression tasks for pneumothorax and cardiomegaly on chest radiographs.Self-supervised RAD-DINO consistently excelled in segmentation tasks, whiletext-supervised CheXagent demonstrated superior classification performance.BiomedCLIP showed inconsistent performance across tasks. A custom segmentationmodel that integrates global and local features substantially improvedperformance for all foundation models, particularly for challengingpneumothorax segmentation. The findings highlight that pre-training methodologysignificantly influences model performance on specific downstream tasks. Forfine-grained segmentation tasks, models trained without text supervisionperformed better, while text-supervised models offered advantages inclassification and interpretability. These insights provide guidance forselecting foundation models based on specific clinical applications inradiology.