Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models

Abstract

We study open-world 3D scene understanding, a family of tasks that requireagents to reason about their 3D environment with an open-set vocabulary andout-of-domain visual inputs - a critical skill for robots to operate in theunstructured 3D world. Towards this end, we propose Semantic Abstraction(SemAbs), a framework that equips 2D Vision-Language Models (VLMs) with new 3Dspatial capabilities, while maintaining their zero-shot robustness. We achievethis abstraction using relevancy maps extracted from CLIP, and learn 3D spatialand geometric reasoning skills on top of those abstractions in asemantic-agnostic manner. We demonstrate the usefulness of SemAbs on twoopen-world 3D scene understanding tasks: 1) completing partially observedobjects and 2) localizing hidden objects from language descriptions.Experiments show that SemAbs can generalize to novel vocabulary,materials/lighting, classes, and domains (i.e., real-world scans) from trainingon limited 3D synthetic data. Code and data will be available athttps://semantic-abstraction.cs.columbia.edu/

Quick Read (beta)

loading the full paper ...