Geometrically-driven Aggregation for Zero-shot 3D Point Cloud Understanding

Abstract

Zero-shot 3D point cloud understanding can be achieved via 2D Vision-LanguageModels (VLMs). Existing strategies directly map Vision-Language Models from 2Dpixels of rendered or captured views to 3D points, overlooking the inherent andexpressible point cloud geometric structure. Geometrically similar or closeregions can be exploited for bolstering point cloud understanding as they arelikely to share semantic information. To this end, we introduce the firsttraining-free aggregation technique that leverages the point cloud's 3Dgeometric structure to improve the quality of the transferred Vision-LanguageModels. Our approach operates iteratively, performing local-to-globalaggregation based on geometric and semantic point-level reasoning. We benchmarkour approach on three downstream tasks, including classification, partsegmentation, and semantic segmentation, with a variety of datasetsrepresenting both synthetic/real-world, and indoor/outdoor scenarios. Ourapproach achieves new state-of-the-art results in all benchmarks. Our approachoperates iteratively, performing local-to-global aggregation based on geometricand semantic point-level reasoning. Code and dataset are available athttps://luigiriz.github.io/geoze-website/

Quick Read (beta)

loading the full paper ...