Vision-Dialog Navigation by Exploring Cross-modal Memory

Abstract

Vision-dialog navigation posed as a new holy-grail task in vision-languagedisciplinary targets at learning an agent endowed with the capability ofconstant conversation for help with natural language and navigating accordingto human responses. Besides the common challenges faced in visual languagenavigation, vision-dialog navigation also requires to handle well with thelanguage intentions of a series of questions about the temporal context fromdialogue history and co-reasoning both dialogs and visual scenes. In thispaper, we propose the Cross-modal Memory Network (CMN) for remembering andunderstanding the rich information relevant to historical navigation actions.Our CMN consists of two memory modules, the language memory module (L-mem) andthe visual memory module (V-mem). Specifically, L-mem learns latentrelationships between the current language interaction and a dialog history byemploying a multi-head attention mechanism. V-mem learns to associate thecurrent visual views and the cross-modal memory about the previous navigationactions. The cross-modal memory is generated via a vision-to-language attentionand a language-to-vision attention. Benefiting from the collaborative learningof the L-mem and the V-mem, our CMN is able to explore the memory about thedecision making of historical navigation actions which is for the current step.Experiments on the CVDN dataset show that our CMN outperforms the previousstate-of-the-art model by a significant margin on both seen and unseenenvironments.

Quick Read (beta)

loading the full paper ...