Transfer learning from language models to image caption generators: Better models may not transfer better

Abstract

When designing a neural caption generator, a convolutional neural network canbe used to extract image features. Is it possible to also use a neural languagemodel to extract sentence prefix features? We answer this question by tryingdifferent ways to transfer the recurrent neural network and embedding layerfrom a neural language model to an image caption generator. We find that imagecaption generators with transferred parameters perform better than thosetrained from scratch, even when simply pre-training them on the text of thesame captions dataset it will later be trained on. We also find that the bestlanguage models (in terms of perplexity) do not result in the best captiongenerators after transfer learning.

Quick Read (beta)

loading the full paper ...