$A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models

Abstract

We study the task of zero-shot vision-and-language navigation (ZS-VLN), apractical yet challenging problem in which an agent learns to navigatefollowing a path described by language instructions without requiring anypath-instruction annotation data. Normally, the instructions have complexgrammatical structures and often contain various action descriptions (e.g.,"proceed beyond", "depart from"). How to correctly understand and execute theseaction demands is a critical problem, and the absence of annotated data makesit even more challenging. Note that a well-educated human being can easilyunderstand path instructions without the need for any special training. In thispaper, we propose an action-aware zero-shot VLN method ($A^2$Nav) by exploitingthe vision-and-language ability of foundation models. Specifically, theproposed method consists of an instruction parser and an action-awarenavigation policy. The instruction parser utilizes the advanced reasoningability of large language models (e.g., GPT-3) to decompose complex navigationinstructions into a sequence of action-specific object navigation sub-tasks.Each sub-task requires the agent to localize the object and navigate to aspecific goal position according to the associated action demand. To accomplishthese sub-tasks, an action-aware navigation policy is learned from freelycollected action-specific datasets that reveal distinct characteristics of eachaction demand. We use the learned navigation policy for executing sub-taskssequentially to follow the navigation instruction. Extensive experiments show$A^2$Nav achieves promising ZS-VLN performance and even surpasses thesupervised learning methods on R2R-Habitat and RxR-Habitat datasets.

Quick Read (beta)

loading the full paper ...