R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering

Abstract

Recently, Visual Question Answering (VQA) has emerged as one of the mostsignificant tasks in multimodal learning as it requires understanding bothvisual and textual modalities. Existing methods mainly rely on extracting imageand question features to learn their joint feature embedding via multimodalfusion or attention mechanism. Some recent studies utilize externalVQA-independent models to detect candidate entities or attributes in images,which serve as semantic knowledge complementary to the VQA task. However, thesecandidate entities or attributes might be unrelated to the VQA task and havelimited semantic capacities. To better utilize semantic knowledge in images, wepropose a novel framework to learn visual relation facts for VQA. Specifically,we build up a Relation-VQA (R-VQA) dataset based on the Visual Genome datasetvia a semantic similarity module, in which each data consists of an image, acorresponding question, a correct answer and a supporting relation fact. Awell-defined relation detector is then adopted to predict visualquestion-related relation facts. We further propose a multi-step attentionmodel composed of visual attention and semantic attention sequentially toextract related visual knowledge and semantic knowledge. We conductcomprehensive experiments on the two benchmark datasets, demonstrating that ourmodel achieves state-of-the-art performance and verifying the benefit ofconsidering visual relation facts.

Quick Read (beta)

loading the full paper ...