Normalized and Geometry-Aware Self-Attention Network for Image Captioning

Abstract

Self-attention (SA) network has shown profound value in image captioning. Inthis paper, we improve SA from two aspects to promote the performance of imagecaptioning. First, we propose Normalized Self-Attention (NSA), areparameterization of SA that brings the benefits of normalization inside SA.While normalization is previously only applied outside SA, we introduce a novelnormalization method and demonstrate that it is both possible and beneficial toperform it on the hidden activations inside SA. Second, to compensate for themajor limit of Transformer that it fails to model the geometry structure of theinput objects, we propose a class of Geometry-aware Self-Attention (GSA) thatextends SA to explicitly and efficiently consider the relative geometryrelations between the objects in the image. To construct our image captioningmodel, we combine the two modules and apply it to the vanilla self-attentionnetwork. We extensively evaluate our proposals on MS-COCO image captioningdataset and superior results are achieved when comparing to state-of-the-artapproaches. Further experiments on three challenging tasks, i.e. videocaptioning, machine translation, and visual question answering, show thegenerality of our methods.

Quick Read (beta)

loading the full paper ...