LangNav: Language as a Perceptual Representation for Navigation

Abstract

We explore the use of language as a perceptual representation forvision-and-language navigation (VLN), with a focus on low-data settings. Ourapproach uses off-the-shelf vision systems for image captioning and objectdetection to convert an agent's egocentric panoramic view at each time stepinto natural language descriptions. We then finetune a pretrained languagemodel to select an action, based on the current view and the trajectoryhistory, that would best fulfill the navigation instructions. In contrast tothe standard setup which adapts a pretrained language model to work directlywith continuous visual features from pretrained vision models, our approachinstead uses (discrete) language as the perceptual representation. We exploreseveral use cases of our language-based navigation (LangNav) approach on theR2R VLN benchmark: generating synthetic trajectories from a prompted languagemodel (GPT-4) with which to finetune a smaller language model; domain transferwhere we transfer a policy learned on one simulated environment (ALFRED) toanother (more realistic) environment (R2R); and combining both vision- andlanguage-based representations for VLN. Our approach is found to improve uponbaselines that rely on visual features in settings where only a few experttrajectories (10-100) are available, demonstrating the potential of language asa perceptual representation for navigation.

Quick Read (beta)

loading the full paper ...