Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models

Abstract

Medical Visual Question Answering (VQA) is an important challenge, as itwould lead to faster and more accurate diagnoses and treatment decisions. Mostexisting methods approach it as a multi-class classification problem, whichrestricts the outcome to a predefined closed-set of curated answers. We focuson open-ended VQA and motivated by the recent advances in language modelsconsider it as a generative task. Leveraging pre-trained language models, weintroduce a novel method particularly suited for small, domain-specific,medical datasets. To properly communicate the medical images to the languagemodel, we develop a network that maps the extracted visual features to a set oflearnable tokens. Then, alongside the question, these learnable tokens directlyprompt the language model. We explore recent parameter-efficient fine-tuningstrategies for language models, which allow for resource- and data-efficientfine-tuning. We evaluate our approach on the prime medical VQA benchmarks,namely, Slake, OVQA and PathVQA. The results demonstrate that our approachoutperforms existing methods across various training settings while also beingcomputationally efficient.

Quick Read (beta)

loading the full paper ...