Localizing Moments in Video with Temporal Language

Abstract

Localizing moments in a longer video via natural language queries is a new,challenging task at the intersection of language and video understanding.Though moment localization with natural language is similar to other languageand vision tasks like natural language object retrieval in images, momentlocalization offers an interesting opportunity to model temporal dependenciesand reasoning in text. We propose a new model that explicitly reasons aboutdifferent temporal segments in a video, and shows that temporal context isimportant for localizing phrases which include temporal language. To benchmarkwhether our model, and other recent video localization models, can effectivelyreason about temporal language, we collect the novel TEMPOral reasoning invideo and language (TEMPO) dataset. Our dataset consists of two parts: adataset with real videos and template sentences (TEMPO - Template Language)which allows for controlled studies on temporal language, and a human languagedataset which consists of temporal sentences annotated by humans (TEMPO - HumanLanguage).

Quick Read (beta)

loading the full paper ...