MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale

Abstract

Medical Visual Question Answering (MedVQA), which offers language responsesto image-based medical inquiries, represents a challenging task and significantadvancement in healthcare. It assists medical experts to swiftly interpretmedical images, thereby enabling faster and more accurate diagnoses. However,the model interpretability and transparency of existing MedVQA solutions areoften limited, posing challenges in understanding their decision-makingprocesses. To address this issue, we devise a semi-automated annotation processto streamlining data preparation and build new benchmark MedVQA datasets R-RADand R-SLAKE. The R-RAD and R-SLAKE datasets provide intermediate medicaldecision-making rationales generated by multimodal large language models andhuman annotations for question-answering pairs in existing MedVQA datasets,i.e., VQA-RAD and SLAKE. Moreover, we design a novel framework which finetuneslightweight pretrained generative models by incorporating medicaldecision-making rationales into the training process. The framework includesthree distinct strategies to generate decision outcomes and correspondingrationales, thereby clearly showcasing the medical decision-making processduring reasoning. Extensive experiments demonstrate that our method can achievean accuracy of 83.5% on R-RAD and 86.3% on R-SLAKE, significantly outperformingexisting state-of-the-art baselines. Dataset and code will be released.

Quick Read (beta)

loading the full paper ...