TRRG: Towards Truthful Radiology Report Generation With Cross-modal Disease Clue Enhanced Large Language Model

Abstract

The vision-language modeling capability of multi-modal large language modelshas attracted wide attention from the community. However, in medical domain,radiology report generation using vision-language models still facessignificant challenges due to the imbalanced data distribution caused bynumerous negated descriptions in radiology reports and issues such as roughalignment between radiology reports and radiography. In this paper, we proposea truthful radiology report generation framework, namely TRRG, based onstage-wise training for cross-modal disease clue injection into large languagemodels. In pre-training stage, During the pre-training phase, contrastivelearning is employed to enhance the ability of visual encoder to perceivefine-grained disease details. In fine-tuning stage, the clue injection modulewe proposed significantly enhances the disease-oriented perception capabilityof the large language model by effectively incorporating the robust zero-shotdisease perception. Finally, through the cross-modal clue interaction module,our model effectively achieves the multi-granular interaction of visualembeddings and an arbitrary number of disease clue embeddings. Thissignificantly enhances the report generation capability and clinicaleffectiveness of multi-modal large language models in the field of radiologyreportgeneration. Experimental results demonstrate that our proposedpre-training and fine-tuning framework achieves state-of-the-art performance inradiology report generation on datasets such as IU-Xray and MIMIC-CXR. Furtheranalysis indicates that our proposed method can effectively enhance the modelto perceive diseases and improve its clinical effectiveness.

Quick Read (beta)

loading the full paper ...