Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks

Abstract

Vision-Language Navigation (VLN) is a task where agents learn to navigatefollowing natural language instructions. The key to this task is to perceiveboth the visual scene and natural language sequentially. Conventionalapproaches exploit the vision and language features in cross-modal grounding.However, the VLN task remains challenging, since previous works have neglectedthe rich semantic information contained in the environment (such as implicitnavigation graphs or sub-trajectory semantics). In this paper, we introduceAuxiliary Reasoning Navigation (AuxRN), a framework with four self-supervisedauxiliary reasoning tasks to take advantage of the additional training signalsderived from the semantic information. The auxiliary tasks have four reasoningobjectives: explaining the previous actions, estimating the navigationprogress, predicting the next orientation, and evaluating the trajectoryconsistency. As a result, these additional training signals help the agent toacquire knowledge of semantic representations in order to reason about itsactivity and build a thorough perception of the environment. Our experimentsindicate that auxiliary reasoning tasks improve both the performance of themain task and the model generalizability by a large margin. Empirically, wedemonstrate that an agent trained with self-supervised auxiliary reasoningtasks substantially outperforms the previous state-of-the-art method, being thebest existing approach on the standard benchmark.

Quick Read (beta)

loading the full paper ...