Dynamic Attention Networks for Task Oriented Grounding

Abstract

In order to successfully perform tasks specified by natural languageinstructions, an artificial agent operating in a visual world needs to mapwords, concepts, and actions from the instruction to visual elements in itsenvironment. This association is termed as Task-Oriented Grounding. In thiswork, we propose a novel Dynamic Attention Network architecture for theefficient multi-modal fusion of text and visual representations which cangenerate a robust definition of state for the policy learner. Our model assumesno prior knowledge from visual and textual domains and is an end to endtrainable. For a 3D visual world where the observation changes continuously,the attention on the visual elements tends to be highly co-related from aone-time step to the next. We term this as "Dynamic Attention". In this work,we show that Dynamic Attention helps in achieving grounding and also aids inthe policy learning objective. Since most practical robotic applications takeplace in the real world where the observation space is continuous, ourframework can be used as a generalized multi-modal fusion unit for roboticcontrol through natural language. We show the effectiveness of using 1Dconvolution over Gated Attention Hadamard product on the rate of convergence ofthe network. We demonstrate that the cell-state of a Long Short Term Memory(LSTM) is a natural choice for modeling Dynamic Attention and shows throughvisualization that the generated attention is very close to how humans tend tofocus on the environment.

Quick Read (beta)

loading the full paper ...