Where to put the Image in an Image Caption Generator

Abstract

When a recurrent neural network language model is used for captiongeneration, the image information can be fed to the neural network either bydirectly incorporating it in the RNN -- conditioning the language model by`injecting' image features -- or in a layer following the RNN -- conditioningthe language model by `merging' image features. While both options are attestedin the literature, there is as yet no systematic comparison between the two. Inthis paper we empirically show that it is not especially detrimental toperformance whether one architecture is used or another. The merge architecturedoes have practical advantages, as conditioning by merging allows the RNN'shidden state vector to shrink in size by up to four times. Our results suggestthat the visual and linguistic modalities for caption generation need not bejointly encoded by the RNN as that yields large, memory-intensive models withfew tangible advantages in performance; rather, the multimodal integrationshould be delayed to a subsequent stage.

Quick Read (beta)

loading the full paper ...