Abstract
Recently, large-scale pre-trained models such as Segment-Anything Model (SAM)and Contrastive Language-Image Pre-training (CLIP) have demonstrated remarkablesuccess and revolutionized the field of computer vision. These foundationvision models effectively capture knowledge from a large-scale broad data withtheir vast model parameters, enabling them to perform zero-shot segmentation onpreviously unseen data without additional training. While they showcasecompetence in 2D tasks, their potential for enhancing 3D scene understandingremains relatively unexplored. To this end, we present a novel framework thatadapts various foundational models for the 3D point cloud segmentation task.Our approach involves making initial predictions of 2D semantic masks usingdifferent large vision models. We then project these mask predictions fromvarious frames of RGB-D video sequences into 3D space. To generate robust 3Dsemantic pseudo labels, we introduce a semantic label fusion strategy thateffectively combines all the results via voting. We examine diverse scenarios,like zero-shot learning and limited guidance from sparse 2D point labels, toassess the pros and cons of different vision foundation models. Our approach isexperimented on ScanNet dataset for 3D indoor scenes, and the resultsdemonstrate the effectiveness of adopting general 2D foundation models onsolving 3D point cloud segmentation tasks.