Exploiting Temporal Relationships in Video Moment Localization with Natural Language

Abstract

We address the problem of video moment localization with natural language,i.e. localizing a video segment described by a natural language sentence. Whilemost prior work focuses on grounding the query as a whole, temporaldependencies and reasoning between events within the text are not fullyconsidered. In this paper, we propose a novel Temporal Compositional ModularNetwork (TCMN) where a tree attention network first automatically decomposes asentence into three descriptions with respect to the main event, context eventand temporal signal. Two modules are then utilized to measure the visualsimilarity and location similarity between each segment and the decomposeddescriptions. Moreover, since the main event and context event may rely ondifferent modalities (RGB or optical flow), we use late fusion to form anensemble of four models, where each model is independently trained by onecombination of the visual input. Experiments show that our model outperformsthe state-of-the-art methods on the TEMPO dataset.

Quick Read (beta)

loading the full paper ...