OVC-Net: Object-Oriented Video Captioning with Temporal Graph and Detail Enhancement

Abstract

Traditional video captioning requests a holistic description of the video,yet the detailed descriptions of the specific objects may not be available.Without associating the moving trajectories, these image-based data-drivenmethods cannot understand the activities from the spatio-temporal transitionsin the inter-object visual features. Besides, adopting ambiguous clip-sentencepairs in training, it goes against learning the multi-modal functional mappingsowing to the one-to-many nature. In this paper, we propose a novel task tounderstand the videos in object-level, named object-oriented video captioning.We introduce the video-based object-oriented video captioning network (OVC)-Netvia temporal graph and detail enhancement to effectively analyze the activitiesalong time and stably capture the vision-language connections undersmall-sample condition. The temporal graph provides useful supplement overprevious image-based approaches, allowing to reason the activities from thetemporal evolution of visual features and the dynamic movement of spatiallocations. The detail enhancement helps to capture the discriminative featuresamong different objects, with which the subsequent captioning module can yieldmore informative and precise descriptions. Thereafter, we construct a newdataset, providing consistent object-sentence pairs, to facilitate effectivecross-modal learning. To demonstrate the effectiveness, we conduct experimentson the new dataset and compare it with the state-of-the-art video captioningmethods. From the experimental results, the OVC-Net exhibits the ability ofprecisely describing the concurrent objects, and achieves the state-of-the-artperformance.

Quick Read (beta)

loading the full paper ...