VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning

Abstract

In this paper, we leverage the human perceiving process, that involves visionand language interaction, to generate a coherent paragraph description ofuntrimmed videos. We propose vision-language (VL) features consisting of twomodalities, i.e., (i) vision modality to capture global visual content of theentire scene and (ii) language modality to extract scene elements descriptionof both human and non-human objects (e.g. animals, vehicles, etc), visual andnon-visual elements (e.g. relations, activities, etc). Furthermore, we proposeto train our proposed VLCap under a contrastive learning VL loss. Theexperiments and ablation studies on ActivityNet Captions and YouCookII datasetsshow that our VLCap outperforms existing SOTA methods on both accuracy anddiversity metrics.

Quick Read (beta)

loading the full paper ...