Alternating Cross-attention Vision-Language Model for Efficient Learning with Medical Image and Report without Curation

Abstract

Recent advances in vision-language pre-training have demonstrated astoundingperformances in diverse vision-language tasks, shedding a light on thelong-standing problems of a comprehensive understanding of both visual andtextual concepts in artificial intelligence research. However, there has beenlimited success in the application of vision-language pre-training in themedical domain, as the current vision-language models and learning strategiesfor photographic images and captions are not optimal to process the medicaldata which are usually insufficient in the amount and the diversity, whichimpedes successful learning of joint vision-language concepts. In this study,we introduce MAX-VL, a model tailored for efficient vision-languagepre-training in the medical domain. We experimentally demonstrated that thepre-trained MAX-VL model outperforms the current state-of-the-art visionlanguage models in various vision-language tasks. We also suggested theclinical utility for the diagnosis of newly emerging diseases and human errordetection as well as showed the widespread applicability of the model indifferent domain data.

Quick Read (beta)

loading the full paper ...