SegDINO3D: 3D Instance Segmentation Empowered by Both Image-Level and Object-Level 2D Features

  • 2025-09-19 15:41:10
  • Jinyuan Qu, Hongyang Li, Xingyu Chen, Shilong Liu, Yukai Shi, Tianhe Ren, Ruitao Jing, Lei Zhang
  • 0

Abstract

In this paper, we present SegDINO3D, a novel Transformer encoder-decoderframework for 3D instance segmentation. As 3D training data is generally not assufficient as 2D training images, SegDINO3D is designed to fully leverage 2Drepresentation from a pre-trained 2D detection model, including bothimage-level and object-level features, for improving 3D representation.SegDINO3D takes both a point cloud and its associated 2D images as input. Inthe encoder stage, it first enriches each 3D point by retrieving 2D imagefeatures from its corresponding image views and then leverages a 3D encoder for3D context fusion. In the decoder stage, it formulates 3D object queries as 3Danchor boxes and performs cross-attention from 3D queries to 2D object queriesobtained from 2D images using the 2D detection model. These 2D object queriesserve as a compact object-level representation of 2D images, effectivelyavoiding the challenge of keeping thousands of image feature maps in the memorywhile faithfully preserving the knowledge of the pre-trained 2D model. Theintroducing of 3D box queries also enables the model to modulatecross-attention using the predicted boxes for more precise querying. SegDINO3Dachieves the state-of-the-art performance on the ScanNetV2 and ScanNet200 3Dinstance segmentation benchmarks. Notably, on the challenging ScanNet200dataset, SegDINO3D significantly outperforms prior methods by +8.7 and +6.8 mAPon the validation and hidden test sets, respectively, demonstrating itssuperiority.

 

Quick Read (beta)

loading the full paper ...