Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding

Abstract

Video Paragraph Grounding (VPG) is an emerging task in video-languageunderstanding, which aims at localizing multiple sentences with semanticrelations and temporal order from an untrimmed video. However, existing VPGapproaches are heavily reliant on a considerable number of temporal labels thatare laborious and time-consuming to acquire. In this work, we introduce andexplore Weakly-Supervised Video Paragraph Grounding (WSVPG) to eliminate theneed of temporal annotations. Different from previous weakly-supervisedgrounding frameworks based on multiple instance learning or reconstructionlearning for two-stage candidate ranking, we propose a novel siamese learningframework that jointly learns the cross-modal feature alignment and temporalcoordinate regression without timestamp labels to achieve concise one-stagelocalization for WSVPG. Specifically, we devise a Siamese Grounding TRansformer(SiamGTR) consisting of two weight-sharing branches for learning complementarysupervision. An Augmentation Branch is utilized for directly regressing thetemporal boundaries of a complete paragraph within a pseudo video, and anInference Branch is designed to capture the order-guided feature correspondencefor localizing multiple sentences in a normal video. We demonstrate byextensive experiments that our paradigm has superior practicability andflexibility to achieve efficient weakly-supervised or semi-supervised learning,outperforming state-of-the-art methods trained with the same or strongersupervision.

Quick Read (beta)

loading the full paper ...