Abstract
In this paper, we propose a general framework for universal zero-shotgoal-oriented navigation. Existing zero-shot methods build inference frameworkupon large language models (LLM) for specific tasks, which differs a lot inoverall pipeline and fails to generalize across different types of goal.Towards the aim of universal zero-shot navigation, we propose a uniform graphrepresentation to unify different goals, including object category, instanceimage and text description. We also convert the observation of agent into anonline maintained scene graph. With this consistent scene and goalrepresentation, we preserve most structural information compared with pure textand are able to leverage LLM for explicit graph-based reasoning. Specifically,we conduct graph matching between the scene graph and goal graph at each timeinstant and propose different strategies to generate long-term goal ofexploration according to different matching states. The agent first iterativelysearches subgraph of goal when zero-matched. With partial matching, the agentthen utilizes coordinate projection and anchor pair alignment to infer the goallocation. Finally scene graph correction and goal verification are applied forperfect matching. We also present a blacklist mechanism to enable robust switchbetween stages. Extensive experiments on several benchmarks show that ourUniGoal achieves state-of-the-art zero-shot performance on three studiednavigation tasks with a single model, even outperforming task-specificzero-shot methods and supervised universal methods.