Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters

Abstract

In Vision-and-Language Navigation (VLN), an embodied agent needs to reach atarget destination with the only guidance of a natural language instruction. Toexplore the environment and progress towards the target location, the agentmust perform a series of low-level actions, such as rotate, before steppingahead. In this paper, we propose to exploit dynamic convolutional filters toencode the visual information and the lingual description in an efficient way.Differently from some previous works that abstract from the agent perspectiveand use high-level navigation spaces, we design a policy which decodes theinformation provided by dynamic convolution into a series of low-level, agentfriendly actions. Results show that our model exploiting dynamic filtersperforms better than other architectures with traditional convolution, beingthe new state of the art for embodied VLN in the low-level action space.Additionally, we attempt to categorize recent work on VLN depending on theirarchitectural choices and distinguish two main groups: we call them low-levelactions and high-level actions models. To the best of our knowledge, we are thefirst to propose this analysis and categorization for VLN.

Quick Read (beta)

loading the full paper ...