Targeted Visual Prompting for Medical Visual Question Answering

Abstract

With growing interest in recent years, medical visual question answering(Med-VQA) has rapidly evolved, with multimodal large language models (MLLMs)emerging as an alternative to classical model architectures. Specifically,their ability to add visual information to the input of pre-trained LLMs bringsnew capabilities for image interpretation. However, simple visual errors castdoubt on the actual visual understanding abilities of these models. To addressthis, region-based questions have been proposed as a means to assess andenhance actual visual understanding through compositional evaluation. Tocombine these two perspectives, this paper introduces targeted visual promptingto equip MLLMs with region-based questioning capabilities. By presenting themodel with both the isolated region and the region in its context in acustomized visual prompt, we show the effectiveness of our method acrossmultiple datasets while comparing it to several baseline models. Our code anddata are available at https://github.com/sergiotasconmorales/locvqallm.

Quick Read (beta)

loading the full paper ...