Abstract
Medical image foundation models have shown the ability to segment organs andtumors with minimal fine-tuning. These models are typically evaluated ontask-specific in-distribution (ID) datasets. However, reliable performance onID dataset does not guarantee robust generalization on out-of-distribution(OOD) datasets. Importantly, once deployed for clinical use, it is impracticalto have `ground truth' delineations to assess ongoing performance drifts,especially when images fall into OOD category due to different imagingprotocols. Hence, we introduced a comprehensive set of computationally fastmetrics to evaluate the performance of multiple foundation models (Swin UNETR,SimMIM, iBOT, SMIT) trained with self-supervised learning (SSL). SSLpretraining was selected as this approach is applicable for large, diverse, andunlabeled image sets. All models were fine-tuned on identical datasets for lungtumor segmentation from computed tomography (CT) scans. SimMIM, iBOT, and SMITused identical architecture, pretraining, and fine-tuning datasets to assessperformance variations with the choice of pretext tasks used in SSL. Evaluationwas performed on two public lung cancer datasets (LRAD: n = 140, 5Rater: n =21) with different image acquisitions and tumor stage compared to training data(n = 317 public resource with stage III-IV lung cancers) and a publicnon-cancer dataset containing volumetric CT scans of patients with pulmonaryembolism (n = 120). All models produced similarly accurate tumor segmentationon the lung cancer testing datasets. SMIT produced a highest F1-score (LRAD:0.60, 5Rater: 0.64) and lowest entropy (LRAD: 0.06, 5Rater: 0.12), indicatinghigher tumor detection rate and confident segmentations. In the OOD dataset,SMIT misdetected least number of tumors, indicated by median volume occupancyof 5.67 cc compared to second best method SimMIM of 9.97 cc.