Abstract
Recent advancements in vision-language systems have improved the accuracy ofRadiological Visual Question Answering (VQA) Models. However, some challengesremain across each stage of model development: limited expert-labeled imageshinders data procurement at scale; the intricate and nuanced patterns ofradiological images make modeling inherently difficult; and the lack ofevaluation evaluation efforts makes it difficult to identify cases where themodel might be ill-conditioned. In this study, we fine-tune a lightweight 3Bparameter vision-language model for Radiological VQA, demonstrating that smallmodels, when appropriately tuned with curated data, can achieve robustperformance across both open- and closed-ended questions. We propose acost-effective training pipeline from synthetic question-answer pair generationto multi-stage fine-tuning on specialised radiological domain-targeted datasets(e.g., ROCO v2.0, MedPix v2.0). Our results show that despite operating at afraction of the scale of state-of-the-art models such as LLaVA-Med, our modelachieves promising performance given its small parameter size and the limitedscale of training data. We introduce a lightweight saliency-based diagnostictool that enables domain experts to inspect VQA model performance and identifyill-conditioned failure modes through saliency analysis.