End-to-End Video Captioning with Multitask Reinforcement Learning

Abstract

Although end-to-end (E2E) learning has led to impressive progress on avariety of visual understanding tasks, it is often impeded by hardwareconstraints (e.g., GPU memory) and is prone to overfitting. When it comes tovideo captioning, one of the most challenging benchmark tasks in computervision, those limitations of E2E learning are especially amplified by the factthat both the input videos and output captions are lengthy sequences. Indeed,state-of-the-art methods for video captioning process video frames byconvolutional neural networks and generate captions by unrolling recurrentneural networks. If we connect them in an E2E manner, the resulting model isboth memory-consuming and data-hungry, making it extremely hard to train. Inthis paper, we propose a multitask reinforcement learning approach to trainingan E2E video captioning model. The main idea is to mine and construct as manyeffective tasks (e.g., attributes, rewards, and the captions) as possible fromthe human captioned videos such that they can jointly regulate the search spaceof the E2E neural network, from which an E2E video captioning model can befound and generalized to the testing phase. To the best of our knowledge, thisis the first video captioning model that is trained end-to-end from the rawvideo input to the caption output. Experimental results show that such a modeloutperforms existing ones to a large margin on two benchmark video captioningdatasets.

Quick Read (beta)

loading the full paper ...