Abstract
Trained with an unprecedented scale of data, large language models (LLMs)like ChatGPT and GPT-4 exhibit the emergence of significant reasoning abilitiesfrom model scaling. Such a trend underscored the potential of training LLMswith unlimited language data, advancing the development of a universal embodiedagent. In this work, we introduce the NavGPT, a purely LLM-basedinstruction-following navigation agent, to reveal the reasoning capability ofGPT models in complex embodied scenes by performing zero-shot sequential actionprediction for vision-and-language navigation (VLN). At each step, NavGPT takesthe textual descriptions of visual observations, navigation history, and futureexplorable directions as inputs to reason the agent's current status, and makesthe decision to approach the target. Through comprehensive experiments, wedemonstrate NavGPT can explicitly perform high-level planning for navigation,including decomposing instruction into sub-goal, integrating commonsenseknowledge relevant to navigation task resolution, identifying landmarks fromobserved scenes, tracking navigation progress, and adapting to exceptions withplan adjustment. Furthermore, we show that LLMs is capable of generatinghigh-quality navigational instructions from observations and actions along apath, as well as drawing accurate top-down metric trajectory given the agent'snavigation history. Despite the performance of using NavGPT to zero-shot R2Rtasks still falling short of trained models, we suggest adapting multi-modalityinputs for LLMs to use as visual navigation agents and applying the explicitreasoning of LLMs to benefit learning-based models.