Abstract
We present a radiology-specific multimodal model for the task for generatingradiological reports from chest X-rays (CXRs). Our work builds on the idea thatlarge language model(s) can be equipped with multimodal capabilities throughalignment with pre-trained vision encoders. On natural images, this has beenshown to allow multimodal models to gain image understanding and descriptioncapabilities. Our proposed model (MAIRA-1) leverages a CXR-specific imageencoder in conjunction with a fine-tuned large language model based onVicuna-7B, and text-based data augmentation, to produce reports withstate-of-the-art quality. In particular, MAIRA-1 significantly improves on theradiologist-aligned RadCliQ metric and across all lexical metrics considered.Manual review of model outputs demonstrates promising fluency and accuracy ofgenerated reports while uncovering failure modes not captured by existingevaluation practices. More information and resources can be found on theproject website: https://aka.ms/maira.