Neural Language Modeling with Visual Features

  • 2019-03-07 14:20:01
  • Antonios Anastasopoulos, Shankar Kumar, Hank Liao
  • 22

Abstract

Multimodal language models attempt to incorporate non-linguistic features forthe language modeling task. In this work, we extend a standard recurrent neuralnetwork (RNN) language model with features derived from videos. We train ourmodels on data that is two orders-of-magnitude bigger than datasets used inprior work. We perform a thorough exploration of model architectures forcombining visual and text features. Our experiments on two corpora (YouCookIIand 20bn-something-something-v2) show that the best performing architectureconsists of middle fusion of visual and text features, yielding over 25%relative improvement in perplexity. We report analysis that provides insightsinto why our multimodal language model improves upon a standard RNN languagemodel.

 

Introduction (beta)

None

 

Conclusion (beta)

None