Abstract
Medical Vision-Language Pretraining (Med-VLP) establishes a connectionbetween visual content from medical images and the relevant textualdescriptions. Existing Med-VLP methods primarily focus on 2D images depicting asingle body part, notably chest X-rays. In this paper, we extend the scope ofMed-VLP to encompass 3D images, specifically targeting full-body scenarios, byusing a multimodal dataset of CT images and reports. Compared with the 2Dcounterpart, 3D VLP is required to effectively capture essential semantics fromsignificantly sparser representation in 3D imaging. In this paper, we introduceCT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel methodthat constructs organ-level image-text pairs to enhance multimodal contrastivelearning, aligning grounded visual features with precise diagnostic text.Additionally, we developed an abnormality dictionary to augment contrastivelearning with diverse contrastive pairs. Our method, trained on a multimodal CTdataset comprising 44,011 organ-level vision-text pairs from 17,702 patientsacross 104 organs, demonstrates it can identify organs and abnormalities in azero-shot manner using natural languages. The performance of CT-GLIP isvalidated on a separate test set of 1,130 patients, focusing on the 16 mostfrequent abnormalities across 7 organs. The experimental results show ourmodel's superior performance over the standard CLIP framework across zero-shotand fine-tuning scenarios, using both CNN and ViT architectures.