Abstract
For robots to perform a wide variety of tasks, they require a 3Drepresentation of the world that is semantically rich, yet compact andefficient for task-driven perception and planning. Recent approaches haveattempted to leverage features from large vision-language models to encodesemantics in 3D representations. However, these approaches tend to produce mapswith per-point feature vectors, which do not scale well in larger environments,nor do they contain semantic spatial relationships between entities in theenvironment, which are useful for downstream planning. In this work, we proposeConceptGraphs, an open-vocabulary graph-structured representation for 3Dscenes. ConceptGraphs is built by leveraging 2D foundation models and fusingtheir output to 3D by multi-view association. The resulting representationsgeneralize to novel semantic classes, without the need to collect large 3Ddatasets or finetune models. We demonstrate the utility of this representationthrough a number of downstream planning tasks that are specified throughabstract (language) prompts and require complex reasoning over spatial andsemantic concepts. (Project page: https://concept-graphs.github.io/ Explainervideo: https://youtu.be/mRhNkQwRYnc )