Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models

Abstract

Medical vision-language alignment through cross-modal contrastive learningshows promising performance in image-text matching tasks, such as retrieval andzero-shot classification. However, conventional cross-modal contrastivelearning (CLIP-based) methods suffer from suboptimal visual representationcapabilities, which also limits their effectiveness in vision-languagealignment. In contrast, although the models pretrained via multimodal maskedmodeling struggle with direct cross-modal matching, they excel in visualrepresentation. To address this contradiction, we propose ALTA (ALign ThroughAdapting), an efficient medical vision-language alignment method that utilizesonly about 8% of the trainable parameters and less than 1/5 of thecomputational consumption required for masked record modeling. ALTA achievessuperior performance in vision-language matching tasks like retrieval andzero-shot classification by adapting the pretrained vision model from maskedrecord modeling. Additionally, we integrate temporal-multiview radiographinputs to enhance the information consistency between radiographs and theircorresponding descriptions in reports, further improving the vision-languagealignment. Experimental evaluations show that ALTA outperforms thebest-performing counterpart by over 4% absolute points in text-to-imageaccuracy and approximately 6% absolute points in image-to-text retrievalaccuracy. The adaptation of vision-language models during efficient alignmentalso promotes better vision and language understanding. Code is publiclyavailable at https://github.com/DopamineLcy/ALTA.

Quick Read (beta)

loading the full paper ...