Cog-GA: A Large Language Models-based Generative Agent for Vision-Language Navigation in Continuous Environments

Abstract

Vision Language Navigation in Continuous Environments (VLN-CE) represents afrontier in embodied AI, demanding agents to navigate freely in unbounded 3Dspaces solely guided by natural language instructions. This task introducesdistinct challenges in multimodal comprehension, spatial reasoning, anddecision-making. To address these challenges, we introduce Cog-GA, a generativeagent founded on large language models (LLMs) tailored for VLN-CE tasks. Cog-GAemploys a dual-pronged strategy to emulate human-like cognitive processes.Firstly, it constructs a cognitive map, integrating temporal, spatial, andsemantic elements, thereby facilitating the development of spatial memorywithin LLMs. Secondly, Cog-GA employs a predictive mechanism for waypoints,strategically optimizing the exploration trajectory to maximize navigationalefficiency. Each waypoint is accompanied by a dual-channel scene description,categorizing environmental cues into 'what' and 'where' streams as the brain.This segregation enhances the agent's attentional focus, enabling it to discernpertinent spatial information for navigation. A reflective mechanismcomplements these strategies by capturing feedback from prior navigationexperiences, facilitating continual learning and adaptive replanning. Extensiveevaluations conducted on VLN-CE benchmarks validate Cog-GA's state-of-the-artperformance and ability to simulate human-like navigation behaviors. Thisresearch significantly contributes to the development of strategic andinterpretable VLN-CE agents.

Quick Read (beta)

loading the full paper ...