Abstract
This paper presents a comprehensive evaluation of GPT-4V's capabilitiesacross diverse medical imaging tasks, including Radiology Report Generation,Medical Visual Question Answering (VQA), and Visual Grounding. While priorefforts have explored GPT-4V's performance in medical image analysis, to thebest of our knowledge, our study represents the first quantitative evaluationon publicly available benchmarks. Our findings highlight GPT-4V's potential ingenerating descriptive reports for chest X-ray images, particularly when guidedby well-structured prompts. Meanwhile, its performance on the MIMIC-CXR datasetbenchmark reveals areas for improvement in certain evaluation metrics, such asCIDEr. In the domain of Medical VQA, GPT-4V demonstrates proficiency indistinguishing between question types but falls short of the VQA-RAD benchmarkin terms of accuracy. Furthermore, our analysis finds the limitations ofconventional evaluation metrics like the BLEU scores, advocating for thedevelopment of more semantically robust assessment methods. In the field ofVisual Grounding, GPT-4V exhibits preliminary promise in recognizing boundingboxes, but its precision is lacking, especially in identifying specific medicalorgans and signs. Our evaluation underscores the significant potential ofGPT-4V in the medical imaging domain, while also emphasizing the need fortargeted refinements to fully unlock its capabilities.