Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge

Abstract

Medical vision-and-language pre-training (Med-VLP) has received considerableattention owing to its applicability to extracting generic vision-and-languagerepresentations from medical images and texts. Most existing methods mainlycontain three elements: uni-modal encoders (i.e., a vision encoder and alanguage encoder), a multi-modal fusion module, and pretext tasks, with fewstudies considering the importance of medical domain expert knowledge andexplicitly exploiting such knowledge to facilitate Med-VLP. Although thereexist knowledge-enhanced vision-and-language pre-training (VLP) methods in thegeneral domain, most require off-the-shelf toolkits (e.g., object detectors andscene graph parsers), which are unavailable in the medical domain. In thispaper, we propose a systematic and effective approach to enhance Med-VLP bystructured medical knowledge from three perspectives. First, consideringknowledge can be regarded as the intermediate medium between vision andlanguage, we align the representations of the vision encoder and the languageencoder through knowledge. Second, we inject knowledge into the multi-modalfusion model to enable the model to perform reasoning using knowledge as thesupplementation of the input image and text. Third, we guide the model to putemphasis on the most critical information in images and texts by designingknowledge-induced pretext tasks. To perform a comprehensive evaluation andfacilitate further research, we construct a medical vision-and-languagebenchmark including three tasks. Experimental results illustrate theeffectiveness of our approach, where state-of-the-art performance is achievedon all downstream tasks. Further analyses explore the effects of differentcomponents of our approach and various settings of pre-training.

Quick Read (beta)

loading the full paper ...