A Novel Attention-based Aggregation Function to Combine Vision and Language

Abstract

The joint understanding of vision and language has been recently gaining alot of attention in both the Computer Vision and Natural Language Processingcommunities, with the emergence of tasks such as image captioning, image-textmatching, and visual question answering. As both images and text can be encodedas sets or sequences of elements -- like regions and words -- proper reductionfunctions are needed to transform a set of encoded elements into a singleresponse, like a classification or similarity score. In this paper, we proposea novel fully-attentive reduction method for vision and language. Specifically,our approach computes a set of scores for each element of each modalityemploying a novel variant of cross-attention, and performs a learnable andcross-modal reduction, which can be used for both classification and ranking.We test our approach on image-text matching and visual question answering,building fair comparisons with other reduction choices, on both COCO and VQA2.0 datasets. Experimentally, we demonstrate that our approach leads to aperformance increase on both tasks. Further, we conduct ablation studies tovalidate the role of each component of the approach.

Quick Read (beta)

loading the full paper ...