Interpretable Visual Question Answering by Reasoning on Dependency Trees

Abstract

Collaborative reasoning for understanding each image-question pair is verycritical but underexplored for an interpretable visual question answeringsystem. Although very recent works also attempted to use explicit compositionalprocesses to assemble multiple subtasks embedded in the questions, their modelsheavily rely on annotations or handcrafted rules to obtain valid reasoningprocesses, leading to either heavy workloads or poor performance on compositionreasoning. In this paper, to better align image and language domains in diverseand unrestricted cases, we propose a novel neural network model that performsglobal reasoning on a dependency tree parsed from the question, and we thusphrase our model as parse-tree-guided reasoning network (PTGRN). This networkconsists of three collaborative modules: i) an attention module to exploit thelocal visual evidence for each word parsed from the question, ii) a gatedresidual composition module to compose the previously mined evidence, and iii)a parse-tree-guided propagation module to pass the mined evidence along theparse tree. Our PTGRN is thus capable of building an interpretable VQA systemthat gradually derives the image cues following a question-driven parse-treereasoning route. Experiments on relational datasets demonstrate the superiorityof our PTGRN over current state-of-the-art VQA methods, and the visualizationresults highlight the explainable capability of our reasoning system.

Quick Read (beta)

loading the full paper ...