Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering

Abstract

A key solution to visual question answering (VQA) exists in how to fusevisual and language features extracted from an input image and question. Weshow that an attention mechanism that enables dense, bi-directionalinteractions between the two modalities contributes to boost accuracy ofprediction of answers. Specifically, we present a simple architecture that isfully symmetric between visual and language representations, in which eachquestion word attends on image regions and each image region attends onquestion words. It can be stacked to form a hierarchy for multi-stepinteractions between an image-question pair. We show through experiments thatthe proposed architecture achieves a new state-of-the-art on VQA and VQA 2.0despite its small size. We also present qualitative evaluation, demonstratinghow the proposed attention mechanism can generate reasonable attention maps onimages and questions, which leads to the correct answer prediction.

Quick Read (beta)

loading the full paper ...