Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

Abstract

Recently a number of studies demonstrated impressive performance on diversevision-language multimodal tasks such as image captioning and visual questionanswering by extending the self-attention based Transformer architecture withmultimodal pre-training objectives. Despite its huge potential, vision-languagemultimodal pre-training in the medical domain has only recently receivedattention, and only demonstrated improved diagnosis accuracy of vision-languagepre-trained models. In this work we explore a broad set of multimodalrepresentation learning tasks in the medical domain, specifically usingradiology images and the unstructured report. We propose a new model whichadopts a Transformer based architecture combined with a novel multimodalattention masking scheme to maximize generalization performance for bothvision-language understanding task (e.g., diagnosis classification) andvision-language generation task (e.g., radiology report generation). Byrigorously evaluating the proposed model on four downstream tasks with threeradiographic image-text datasets (MIMIC-CXR, Open-I, and VQA-RAD), weempirically demonstrate the superior downstream task performance and generalityof our model against various baselines including task specific architectures.In addition, we qualitatively analyze our model by showing the results ofretrieved image-report pairs, the attention map visualization, and generatedreports. Our proposed multimodal pre-training model could flexibly adapt tomultiple downstream tasks of vision-language understanding and generation witha novel self-attention scheme. We believe that our approach can provide thebasis for a wide range of interpretations of vision-language multimodal in themedical domain.

Quick Read (beta)

loading the full paper ...