Video Moment Retrieval via Natural Language Queries

Abstract

In this paper, we propose a novel method for video moment retrieval (VMR)that achieves state of the arts (SOTA) performance on R@1 metrics andsurpassing the SOTA on the high IoU metric (R@1, IoU=0.7). First, we propose to use a multi-head self-attention mechanism, and further across-attention scheme to capture video/query interaction and long-range querydependencies from video context. The attention-based methods can developframe-to-query interaction and query-to-frame interaction at arbitrarypositions and the multi-head setting ensures the sufficient understanding ofcomplicated dependencies. Our model has a simple architecture, which enablesfaster training and inference while maintaining . Second, We also propose to use multiple task training objective consists ofmoment segmentation task, start/end distribution prediction and start/endlocation regression task. We have verified that start/end prediction are noisydue to annotator disagreement and joint training with moment segmentation taskcan provide richer information since frames inside the target clip are alsoutilized as positive training examples. Third, we propose to use an early fusion approach, which achieves betterperformance at the cost of inference time. However, the inference time will notbe a problem for our model since our model has a simple architecture whichenables efficient training and inference.

Quick Read (beta)

loading the full paper ...