Perceive, Transform, and Act: Multi-Modal Attention Networks for Vision-and-Language Navigation

Abstract

Vision-and-Language Navigation (VLN) is a challenging task in which an agentneeds to follow a language-specified path to reach a target destination. Inthis paper, we strive for the creation of an agent able to tackle three keyissues: multi-modality, long-term dependencies, and adaptability towardsdifferent locomotive settings. To that end, we devise "Perceive, Transform, andAct" (PTA): a fully-attentive VLN architecture that leaves the recurrentapproach behind and the first Transformer-like architecture incorporating threedifferent modalities - natural language, images, and discrete actions for theagent control. In particular, we adopt an early fusion strategy to mergelingual and visual information efficiently in our encoder. We then propose torefine the decoding phase with a late fusion extension between the agent'shistory of actions and the perception modalities. We experimentally validateour model on two datasets and two different action settings. PTA surpassesprevious state-of-the-art architectures for low-level VLN on R2R and achievesthe first place for both setups in the recently proposed R4R benchmark. Ourcode is publicly available athttps://github.com/aimagelab/perceive-transform-and-act.

Quick Read (beta)

loading the full paper ...