SAM-Guided Masked Token Prediction for 3D Scene Understanding

Abstract

Foundation models have significantly enhanced 2D task performance, and recentworks like Bridge3D have successfully applied these models to improve 3D sceneunderstanding through knowledge distillation, marking considerableadvancements. Nonetheless, challenges such as the misalignment between 2D and3D representations and the persistent long-tail distribution in 3D datasetsstill restrict the effectiveness of knowledge distillation from 2D to 3D usingfoundation models. To tackle these issues, we introduce a novel SAM-guidedtokenization method that seamlessly aligns 3D transformer structures withregion-level knowledge distillation, replacing the traditional KNN-basedtokenization techniques. Additionally, we implement a group-balancedre-weighting strategy to effectively address the long-tail problem in knowledgedistillation. Furthermore, inspired by the recent success of masked featureprediction, our framework incorporates a two-stage masked token predictionprocess in which the student model predicts both the global embeddings and thetoken-wise local embeddings derived from the teacher models trained in thefirst stage. Our methodology has been validated across multiple datasets,including SUN RGB-D, ScanNet, and S3DIS, for tasks like 3D object detection andsemantic segmentation. The results demonstrate significant improvements overcurrent State-of-the-art self-supervised methods, establishing new benchmarksin this field.

Quick Read (beta)

loading the full paper ...