Abstract
Visual question answering (VQA) is challenging because it requires asimultaneous understanding of both the visual content of images and the textualcontent of questions. The approaches used to represent the images and questionsin a fine-grained manner and questions and to fuse these multi-modal featuresplay key roles in performance. Bilinear pooling based models have been shown tooutperform traditional linear models for VQA, but their high-dimensionalrepresentations and high computational complexity may seriously limit theirapplicability in practice. For multi-modal feature fusion, here we develop aMulti-modal Factorized Bilinear (MFB) pooling approach to efficiently andeffectively combine multi-modal features, which results in superior performancefor VQA compared with other bilinear pooling approaches. For fine-grained imageand question representation, we develop a co-attention mechanism using anend-to-end deep network architecture to jointly learn both the image andquestion attentions. Combining the proposed MFB approach with co-attentionlearning in a new network architecture provides a unified model for VQA. Ourexperimental results demonstrate that the single MFB with co-attention modelachieves new state-of-the-art performance on the real-world VQA dataset. Codeavailable at https://github.com/yuzcccc/mfb.