Abstract
We address the challenging task of cross-modal moment retrieval, which aimsto localize a temporal segment from an untrimmed video described by a naturallanguage query. It poses great challenges over the proper semantic alignmentbetween vision and linguistic domains. Most of these methods only leveragesentences in the multi-modal fusion stage and independently extract thefeatures of videos and sentences, which do not make full use of the potentialof language. In this paper, we present Language Guided Networks (LGN), a newframework that tightly integrates cross-modal features in multiple stages. Inthe first feature extraction stage, we introduce to capture the discriminativevisual features which can cover the complex semantics in the sentence query.Specifically, the early modulation unit is designed to modulate convolutionalfeature maps by a linguistic embedding. Then we adopt a multi-modal fusionmodule in the second fusion stage. Finally, to get a precise localizer, thesentence information is utilized to guide the process of predicting temporalpositions. Specifically, the late guidance module is developed to furtherbridge vision and language domain via the channel attention mechanism. Weevaluate the proposed model on two popular public datasets: Charades-STA andTACoS. The experimental results demonstrate the superior performance of ourproposed modules on moment retrieval (improving 5.8\% in terms of R1@IoU5 onCharades-STA and 5.2\% on TACoS). We put the codes in the supplementarymaterial and will make it publicly available.