MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment

Abstract

This research strives for natural language moment retrieval in long,untrimmed video streams. The problem nevertheless is not trivial especiallywhen a video contains multiple moments of interests and the language describescomplex temporal dependencies, which often happens in real scenarios. Weidentify two crucial challenges: semantic misalignment and structuralmisalignment. However, existing approaches treat different moments separatelyand do not explicitly model complex moment-wise temporal relations. In thispaper, we present Moment Alignment Network (MAN), a novel framework thatunifies the candidate moment encoding and temporal structural reasoning in asingle-shot feed-forward network. MAN naturally assigns candidate momentrepresentations aligned with language semantics over different temporallocations and scales. Most importantly, we propose to explicitly modelmoment-wise temporal relations as a structured graph and devise an iterativegraph adjustment network to jointly learn the best structure in an end-to-endmanner. We evaluate the proposed approach on two challenging public benchmarksCharades-STA and DiDeMo, where our MAN significantly outperforms thestate-of-the-art by a large margin.

Quick Read (beta)

loading the full paper ...