Abstract
In the field of medical Vision-Language Pre-training (VLP), significantefforts have been devoted to deriving text and image features from bothclinical reports and associated medical images. However, most existing methodsmay have overlooked the opportunity in leveraging the inherent hierarchicalstructure of clinical reports, which are generally split into `findings' fordescriptive content and `impressions' for conclusive observation. Instead ofutilizing this rich, structured format, current medical VLP approaches oftensimplify the report into either a unified entity or fragmented tokens. In thiswork, we propose a novel clinical prior guided VLP framework named IMITATE tolearn the structure information from medical reports with hierarchicalvision-language alignment. The framework derives multi-level visual featuresfrom the chest X-ray (CXR) images and separately aligns these features with thedescriptive and the conclusive text encoded in the hierarchical medical report.Furthermore, a new clinical-informed contrastive loss is introduced forcross-modal learning, which accounts for clinical prior knowledge informulating sample correlations in contrastive learning. The proposed model,IMITATE, outperforms baseline VLP methods across six different datasets,spanning five medical imaging downstream tasks. Comprehensive experimentalresults highlight the advantages of integrating the hierarchical structure ofmedical reports for vision-language alignment.