Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation

Abstract

Recent open-vocabulary robot mapping methods enrich dense geometric maps withpre-trained visual-language features. While these maps allow for the predictionof point-wise saliency maps when queried for a certain language concept,large-scale environments and abstract queries beyond the object level stillpose a considerable hurdle, ultimately limiting language-grounded roboticnavigation. In this work, we present HOV-SG, a hierarchical open-vocabulary 3Dscene graph mapping approach for language-grounded robot navigation. Leveragingopen-vocabulary vision foundation models, we first obtain state-of-the-artopen-vocabulary segment-level maps in 3D and subsequently construct a 3D scenegraph hierarchy consisting of floor, room, and object concepts, each enrichedwith open-vocabulary features. Our approach is able to represent multi-storybuildings and allows robotic traversal of those using a cross-floor Voronoigraph. HOV-SG is evaluated on three distinct datasets and surpasses previousbaselines in open-vocabulary semantic accuracy on the object, room, and floorlevel while producing a 75% reduction in representation size compared to denseopen-vocabulary maps. In order to prove the efficacy and generalizationcapabilities of HOV-SG, we showcase successful long-horizonlanguage-conditioned robot navigation within real-world multi-storageenvironments. We provide code and trial video data at http://hovsg.github.io/.

Quick Read (beta)

loading the full paper ...