Component Analysis for Visual Question Answering Architectures

Abstract

Recent research advances in Computer Vision and Natural Language Processinghave introduced novel tasks that are paving the way for solving AI-completeproblems. One of those tasks is called Visual Question Answering (VQA). A VQAsystem must take an image and a free-form, open-ended natural language questionabout the image, and produce a natural language answer as the output. Such atask has drawn great attention from the scientific community, which generated aplethora of approaches that aim to improve the VQA predictive accuracy. Most ofthem comprise three major components: (i) independent representation learningof images and questions; (ii) feature fusion so the model can use informationfrom both sources to answer visual questions; and (iii) the generation of thecorrect answer in natural language. With so many approaches being recentlyintroduced, it became unclear the real contribution of each component for theultimate performance of the model. The main goal of this paper is to provide acomprehensive analysis regarding the impact of each component in VQA models.Our extensive set of experiments cover both visual and textual elements, aswell as the combination of these representations in form of fusion andattention mechanisms. Our major contribution is to identify core components fortraining VQA models so as to maximize their predictive performance.

Quick Read (beta)

loading the full paper ...