Image Captioning using Deep Stacked LSTMs, Contextual Word Embeddings and Data Augmentation

Abstract

Image Captioning, or the automatic generation of descriptions for images, isone of the core problems in Computer Vision and has seen considerable progressusing Deep Learning Techniques. We propose to use Inception-ResNetConvolutional Neural Network as encoder to extract features from images,Hierarchical Context based Word Embeddings for word representations and a DeepStacked Long Short Term Memory network as decoder, in addition to using ImageData Augmentation to avoid over-fitting. For data Augmentation, we useHorizontal and Vertical Flipping in addition to Perspective Transformations onthe images. We evaluate our proposed methods with two image captioningframeworks- Encoder-Decoder and Soft Attention. Evaluation on widely usedmetrics have shown that our approach leads to considerable improvement in modelperformance.

Quick Read (beta)

loading the full paper ...