Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention

Abstract

This paper studies the problem of temporal moment localization in a longuntrimmed video using natural language as the query. Given an untrimmed videoand a sentence as the query, the goal is to determine the starting, and theending, of the relevant visual moment in the video, that corresponds to thequery sentence. While previous works have tackled this task by apropose-and-rank approach, we introduce a more efficient, end-to-end trainable,and {\em proposal-free approach} that relies on three key components: a dynamicfilter to transfer language information to the visual domain, a new lossfunction to guide our model to attend the most relevant parts of the video, andsoft labels to model annotation uncertainty. We evaluate our method on twobenchmark datasets, Charades-STA and ActivityNet-Captions. Experimental resultsshow that our approach outperforms state-of-the-art methods on both datasets.

Quick Read (beta)

loading the full paper ...