Adaptively Aligned Image Captioning via Adaptive Attention Time

Abstract

Recent neural models for image captioning usually employs an encoder-decoderframework with attention mechanism. However, the attention mechanism in such aframework aligns one single (attended) image feature vector to one captionword, assuming one-to-one mapping from source image regions and target captionwords, which is never possible. In this paper, we propose a novel attentionmodel, namely Adaptive Attention Time (AAT), which can adaptively align sourceto target for image captioning. AAT allows the framework to learn how manyattention steps to take to output a caption word at each decoding step. WithAAT, image regions and caption words can be aligned adaptively in the decodingprocess: an image region can be mapped to arbitrary number of caption wordswhile a caption word can also attend to arbitrary number of image regions. AATis deterministic and differentiable, and doesn't introduce any noise to theparameter gradients. AAT is also generic and can be employed by anysequence-to-sequence learning task. In this paper, we empirically show that AATimproves over state-of-the-art methods on the task of image captioning.

Quick Read (beta)

loading the full paper ...