MoRE: Multi-Modal Contrastive Pre-training with Transformers on X-Rays, ECGs, and Diagnostic Report

Abstract

In this paper, we introduce a novel Multi-Modal Contrastive Pre-trainingFramework that synergistically combines X-rays, electrocardiograms (ECGs), andradiology/cardiology reports. Our approach leverages transformers to encodethese diverse modalities into a unified representation space, aiming to enhancediagnostic accuracy and facilitate comprehensive patient assessments. Weutilize LoRA-Peft to significantly reduce trainable parameters in the LLM andincorporate recent linear attention dropping strategy in the VisionTransformer(ViT) for smoother attention. Furthermore, we provide novelmultimodal attention explanations and retrieval for our model. To the best ofour knowledge, we are the first to propose an integrated model that combinesX-ray, ECG, and Radiology/Cardiology Report with this approach. By utilizingcontrastive loss, MoRE effectively aligns modality-specific features into acoherent embedding, which supports various downstream tasks such as zero-shotclassification and multimodal retrieval. Employing our proposed methodology, weachieve state-of-the-art (SOTA) on the Mimic-IV, CheXpert, Edema Severity, andPtbXl downstream datasets, surpassing existing multimodal approaches. Ourproposed framework shows significant improvements in capturing intricateinter-modal relationships and its robustness in medical diagnosis thatestablishes a framework for future research in multimodal learning in thehealthcare sector.

Quick Read (beta)

loading the full paper ...