OpenNav: Open-World Navigation with Multimodal Large Language Models

Abstract

Pre-trained large language models (LLMs) have demonstrated strongcommon-sense reasoning abilities, making them promising for robotic navigationand planning tasks. However, despite recent progress, bridging the gap betweenlanguage descriptions and actual robot actions in the open-world, beyond merelyinvoking limited predefined motion primitives, remains an open challenge. Inthis work, we aim to enable robots to interpret and decompose complex languageinstructions, ultimately synthesizing a sequence of trajectory points tocomplete diverse navigation tasks given open-set instructions and open-setobjects. We observe that multi-modal large language models (MLLMs) exhibitstrong cross-modal understanding when processing free-form languageinstructions, demonstrating robust scene comprehension. More importantly,leveraging their code-generation capability, MLLMs can interact withvision-language perception models to generate compositional 2D bird-eye-viewvalue maps, effectively integrating semantic knowledge from MLLMs with spatialinformation from maps to reinforce the robot's spatial understanding. Tofurther validate our approach, we effectively leverage large-scale autonomousvehicle datasets (AVDs) to validate our proposed zero-shot vision-languagenavigation framework in outdoor navigation tasks, demonstrating its capabilityto execute a diverse range of free-form natural language navigationinstructions while maintaining robustness against object detection errors andlinguistic ambiguities. Furthermore, we validate our system on a Husky robot inboth indoor and outdoor scenes, demonstrating its real-world robustness andapplicability. Supplementary videos are available athttps://trailab.github.io/OpenNav-website/

Quick Read (beta)

loading the full paper ...