Abstract
Multimodal language models attempt to incorporate non-linguistic features forthe language modeling task. In this work, we extend a standard recurrent neuralnetwork (RNN) language model with features derived from videos. We train ourmodels on data that is two orders-of-magnitude bigger than datasets used inprior work. We perform a thorough exploration of model architectures forcombining visual and text features. Our experiments on two corpora (YouCookIIand 20bn-something-something-v2) show that the best performing architectureconsists of middle fusion of visual and text features, yielding over 25%relative improvement in perplexity. We report analysis that provides insightsinto why our multimodal language model improves upon a standard RNN languagemodel.