Vision-Language Models for Automated Chest X-ray Interpretation: Leveraging ViT and GPT-2

Abstract

Radiology plays a pivotal role in modern medicine due to its non-invasivediagnostic capabilities. However, the manual generation of unstructured medicalreports is time consuming and prone to errors. It creates a significantbottleneck in clinical workflows. Despite advancements in AI-generatedradiology reports, challenges remain in achieving detailed and accurate reportgeneration. In this study we have evaluated different combinations ofmultimodal models that integrate Computer Vision and Natural LanguageProcessing to generate comprehensive radiology reports. We employed apretrained Vision Transformer (ViT-B16) and a SWIN Transformer as the imageencoders. The BART and GPT-2 models serve as the textual decoders. We usedChest X-ray images and reports from the IU-Xray dataset to evaluate theusability of the SWIN Transformer-BART, SWIN Transformer-GPT-2, ViT-B16-BARTand ViT-B16-GPT-2 models for report generation. We aimed at finding the bestcombination among the models. The SWIN-BART model performs as thebest-performing model among the four models achieving remarkable results inalmost all the evaluation metrics like ROUGE, BLEU and BERTScore.

Quick Read (beta)

loading the full paper ...