Video Captioning with Boundary-aware Hierarchical Language Decoding and Joint Video Prediction

Abstract

The explosion of video data on the internet requires effective and efficienttechnology to generate captions automatically for people who are not able towatch the videos. Despite the great progress of video captioning research,particularly on video feature encoding, the language decoder is still largelybased on the prevailing RNN decoder such as LSTM, which tends to prefer thefrequent word that aligns with the video. In this paper, we propose aboundary-aware hierarchical language decoder for video captioning, whichconsists of a high-level GRU based language decoder, working as a global(caption-level) language model, and a low-level GRU based language decoder,working as a local (phrase-level) language model. Most importantly, weintroduce a binary gate into the low-level GRU language decoder to detect thelanguage boundaries. Together with other advanced components including jointvideo prediction, shared soft attention, and boundary-aware video encoding, ourintegrated video captioning framework can discover hierarchical languageinformation and distinguish the subject and the object in a sentence, which areusually confusing during the language generation. Extensive experiments on twowidely-used video captioning datasets, MSR-Video-to-Text (MSR-VTT)\cite{xu2016msr} and YouTube-to-Text (MSVD) \cite{chen2011collecting} show thatour method is highly competitive, compared with the state-of-the-art methods.

Quick Read (beta)

loading the full paper ...