A sequential guiding network with attention for image captioning

Abstract

The recent advances of deep learning in both computer vision (CV)and naturallanguage processing (NLP) provide us a new way of un-derstanding semantics, bywhich we can deal with more challeng-ing tasks such as automatic descriptiongeneration from natural im-ages. In this challenge, the encoder-decoderframework has achievedpromising performance when a convolutional neural network(CNN)is used as image encoder and a recurrent neural network (RNN) asdecoder.In this paper, we introduce a sequential guiding networkthat guides the decoderduring word generation. The new model is anextension of the encoder-decoderframework with attention that hasan additional guiding long short-term memory(LSTM) and can betrained in an end-to-end manner by using image/descriptionspairs.We validate our approach by conducting extensive experiments onabenchmark dataset, i.e., MS COCO Captions. The proposed model achievessignificant improvement comparing to the other state-of-the-art deep learningmodels.

Quick Read (beta)

loading the full paper ...