Abstract
Medical Visual Question Answering (VQA) is an important challenge, as itwould lead to faster and more accurate diagnoses and treatment decisions. Mostexisting methods approach it as a multi-class classification problem, whichrestricts the outcome to a predefined closed-set of curated answers. We focuson open-ended VQA and motivated by the recent advances in language modelsconsider it as a generative task. Leveraging pre-trained language models, weintroduce a novel method particularly suited for small, domain-specific,medical datasets. To properly communicate the medical images to the languagemodel, we develop a network that maps the extracted visual features to a set oflearnable tokens. Then, alongside the question, these learnable tokens directlyprompt the language model. We explore recent parameter-efficient fine-tuningstrategies for language models, which allow for resource- and data-efficientfine-tuning. We evaluate our approach on the prime medical VQA benchmarks,namely, Slake, OVQA and PathVQA. The results demonstrate that our approachoutperforms existing methods across various training settings while also beingcomputationally efficient.