Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-Training

Abstract

Medical vision-and-language pre-training provides a feasible solution toextract effective vision-and-language representations from medical images andtexts. However, few studies have been dedicated to this field to facilitatemedical vision-and-language understanding. In this paper, we propose aself-supervised learning paradigm with multi-modal masked autoencoders(M$^3$AE), which learn cross-modal domain knowledge by reconstructing missingpixels and tokens from randomly masked images and texts. There are three keydesigns to make this simple approach work. First, considering the differentinformation densities of vision and language, we adopt different masking ratiosfor the input image and text, where a considerably larger masking ratio is usedfor images. Second, we use visual and textual features from different layers toperform the reconstruction to deal with different levels of abstraction invisual and language. Third, we develop different designs for vision andlanguage decoders (i.e., a Transformer for vision and a multi-layer perceptronfor language). To perform a comprehensive evaluation and facilitate furtherresearch, we construct a medical vision-and-language benchmark including threetasks. Experimental results demonstrate the effectiveness of our approach,where state-of-the-art results are achieved on all downstream tasks. Besides,we conduct further analysis to better verify the effectiveness of differentcomponents of our approach and various settings of pre-training. The sourcecode is available at~\url{https://github.com/zhjohnchan/M3AE}.

Quick Read (beta)

loading the full paper ...