Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

Abstract

Recently a number of studies demonstrated impressive performance on diversevision-language multi-modal tasks such as image captioning and visual questionanswering by extending the BERT architecture with multi-modal pre-trainingobjectives. In this work we explore a broad set of multi-modal representationlearning tasks in the medical domain, specifically using radiology images andthe unstructured report. We propose Medical Vision Language Learner (MedViLL),which adopts a BERT-based architecture combined with a novel multi-modalattention masking scheme to maximize generalization performance for bothvision-language understanding tasks (diagnosis classification, medicalimage-report retrieval, medical visual question answering) and vision-languagegeneration task (radiology report generation). By statistically and rigorouslyevaluating the proposed model on four downstream tasks with three radiographicimage-report datasets (MIMIC-CXR, Open-I, and VQA-RAD), we empiricallydemonstrate the superior downstream task performance of MedViLL against variousbaselines, including task-specific architectures. The source code is publiclyavailable at: https://github.com/SuperSupermoon/MedViLL

Quick Read (beta)

loading the full paper ...