Dual Recurrent Attention Units for Visual Question Answering

Abstract

We propose an architecture for VQA which utilizes recurrent layers togenerate visual and textual attention. The memory characteristic of theproposed recurrent attention units offers a rich joint embedding of visual andtextual features and enables the model to reason relations between severalparts of the image and question. Our single model outperforms the first placewinner on the VQA 1.0 dataset, performs within margin to the currentstate-of-the-art ensemble model. We also experiment with replacing attentionmechanisms in other state-of-the-art models with our implementation and showincreased accuracy. In both cases, our recurrent attention mechanism improvesperformance in tasks requiring sequential or relational reasoning on the VQAdataset.

Quick Read (beta)

loading the full paper ...