Abstract
Vision-language models (VLMs) have been widely applied to 2D medical imageanalysis due to their ability to align visual and textual representations.However, extending VLMs to 3D imaging remains computationally challenging.Existing 3D VLMs often rely on Vision Transformers (ViTs), which arecomputationally expensive due to the quadratic complexity of self-attention, oron 3D convolutions, which require large numbers of parameters and FLOPs askernel size increases. We introduce DCFormer, an efficient 3D image encoderthat factorizes 3D convolutions into three parallel 1D convolutions along thedepth, height, and width dimensions. This design preserves spatial informationwhile significantly reducing computational cost. Integrated into a CLIP-basedvision-language framework, DCFormer is trained and evaluated on CT-RATE, adataset of 50,188 paired 3D chest CT volumes and radiology reports. Inzero-shot and fine-tuned detection of 18 pathologies, as well as in image-textretrieval tasks, DCFormer consistently outperforms state-of-the-art 3D visionencoders, including CT-ViT, ViT, ConvNeXt, PoolFormer, and TransUNet. Theseresults highlight DCFormer's potential for scalable, clinically deployable 3Dmedical VLMs. Our code is available at: https://github.com/mirthAI/DCFormer.