How Far Have Medical Vision-Language Models Come? A Comprehensive Benchmarking Study

  • 2025-07-18 16:56:02
  • Che Liu, Jiazhen Pan, Weixiang Shen, Wenjia Bai, Daniel Rueckert, Rossella Arcucci
  • 0

Abstract

Vision-Language Models (VLMs) trained on web-scale corpora excel at naturalimage tasks and are increasingly repurposed for healthcare; however, theircompetence in medical tasks remains underexplored. We present a comprehensiveevaluation of open-source general-purpose and medically specialised VLMs,ranging from 3B to 72B parameters, across eight benchmarks: MedXpert,OmniMedVQA, PMC-VQA, PathVQA, MMMU, SLAKE, and VQA-RAD. To observe modelperformance across different aspects, we first separate it into understandingand reasoning components. Three salient findings emerge. First, largegeneral-purpose models already match or surpass medical-specific counterpartson several benchmarks, demonstrating strong zero-shot transfer from natural tomedical images. Second, reasoning performance is consistently lower thanunderstanding, highlighting a critical barrier to safe decision support. Third,performance varies widely across benchmarks, reflecting differences in taskdesign, annotation quality, and knowledge demands. No model yet reaches thereliability threshold for clinical deployment, underscoring the need forstronger multimodal alignment and more rigorous, fine-grained evaluationprotocols.

 

Quick Read (beta)

loading the full paper ...