Insights into a radiology-specialised multimodal large language model with sparse autoencoders

  • 2025-07-18 09:19:19
  • Kenza Bouzid, Shruthi Bannur, Felix Meissen, Daniel Coelho de Castro, Anton Schwaighofer, Javier Alvarez-Valle, Stephanie L. Hyland
  • 0

Abstract

Interpretability can improve the safety, transparency and trust of AI models,which is especially important in healthcare applications where decisions oftencarry significant consequences. Mechanistic interpretability, particularlythrough the use of sparse autoencoders (SAEs), offers a promising approach foruncovering human-interpretable features within large transformer-based models.In this study, we apply Matryoshka-SAE to the radiology-specialised multimodallarge language model, MAIRA-2, to interpret its internal representations. Usinglarge-scale automated interpretability of the SAE features, we identify a rangeof clinically relevant concepts - including medical devices (e.g., line andtube placements, pacemaker presence), pathologies such as pleural effusion andcardiomegaly, longitudinal changes and textual features. We further examine theinfluence of these features on model behaviour through steering, demonstratingdirectional control over generations with mixed success. Our results revealpractical and methodological challenges, yet they offer initial insights intothe internal concepts learned by MAIRA-2 - marking a step toward deepermechanistic understanding and interpretability of a radiology-adaptedmultimodal large language model, and paving the way for improved modeltransparency. We release the trained SAEs and interpretations:https://huggingface.co/microsoft/maira-2-sae.

 

Quick Read (beta)

loading the full paper ...