IVLMap: Instance-Aware Visual Language Grounding for Consumer Robot Navigation

Abstract

Vision-and-Language Navigation (VLN) is a challenging task that requires arobot to navigate in photo-realistic environments with human natural languagepromptings. Recent studies aim to handle this task by constructing the semanticspatial map representation of the environment, and then leveraging the strongability of reasoning in large language models for generalizing code for guidingthe robot navigation. However, these methods face limitations in instance-leveland attribute-level navigation tasks as they cannot distinguish differentinstances of the same object. To address this challenge, we propose a newmethod, namely, Instance-aware Visual Language Map (IVLMap), to empower therobot with instance-level and attribute-level semantic mapping, where it isautonomously constructed by fusing the RGBD video data collected from the robotagent with special-designed natural language map indexing in the bird's-in-eyeview. Such indexing is instance-level and attribute-level. In particular, whenintegrated with a large language model, IVLMap demonstrates the capability toi) transform natural language into navigation targets with instance andattribute information, enabling precise localization, and ii) accomplishzero-shot end-to-end navigation tasks based on natural language commands.Extensive navigation experiments are conducted. Simulation results illustratethat our method can achieve an average improvement of 14.4\% in navigationaccuracy. Code and demo are released at https://ivlmap.github.io/.

Quick Read (beta)

loading the full paper ...