Abstract
Recent advancements in Contrastive Language-Image Pre-training (CLIP) havedemonstrated notable success in self-supervised representation learning acrossvarious tasks. However, the existing CLIP-like approaches often demandextensive GPU resources and prolonged training times due to the considerablesize of the model and dataset, making them poor for medical applications, inwhich large datasets are not always common. Meanwhile, the language modelprompts are mainly manually derived from labels tied to images, potentiallyoverlooking the richness of information within training samples. We introduce anovel language-image Contrastive Learning method with an Efficient largelanguage model and prompt Fine-Tuning (CLEFT) that harnesses the strengths ofthe extensive pre-trained language and visual models. Furthermore, we presentan efficient strategy for learning context-based prompts that mitigates the gapbetween informative clinical diagnostic data and simple class labels. Ourmethod demonstrates state-of-the-art performance on multiple chest X-ray andmammography datasets compared with various baselines. The proposed parameterefficient framework can reduce the total trainable model size by 39% and reducethe trainable language model to only 4% compared with the current BERT encoder.