Abstract
In this paper, we present SegDINO3D, a novel Transformer encoder-decoderframework for 3D instance segmentation. As 3D training data is generally not assufficient as 2D training images, SegDINO3D is designed to fully leverage 2Drepresentation from a pre-trained 2D detection model, including bothimage-level and object-level features, for improving 3D representation.SegDINO3D takes both a point cloud and its associated 2D images as input. Inthe encoder stage, it first enriches each 3D point by retrieving 2D imagefeatures from its corresponding image views and then leverages a 3D encoder for3D context fusion. In the decoder stage, it formulates 3D object queries as 3Danchor boxes and performs cross-attention from 3D queries to 2D object queriesobtained from 2D images using the 2D detection model. These 2D object queriesserve as a compact object-level representation of 2D images, effectivelyavoiding the challenge of keeping thousands of image feature maps in the memorywhile faithfully preserving the knowledge of the pre-trained 2D model. Theintroducing of 3D box queries also enables the model to modulatecross-attention using the predicted boxes for more precise querying. SegDINO3Dachieves the state-of-the-art performance on the ScanNetV2 and ScanNet200 3Dinstance segmentation benchmarks. Notably, on the challenging ScanNet200dataset, SegDINO3D significantly outperforms prior methods by +8.7 and +6.8 mAPon the validation and hidden test sets, respectively, demonstrating itssuperiority.