XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation

Abstract

Existing methodologies in open vocabulary 3D semantic segmentation primarilyconcentrate on establishing a unified feature space encompassing 3D, 2D, andtextual modalities. Nevertheless, traditional techniques such as global featurealignment or vision-language model distillation tend to impose only approximatecorrespondence, struggling notably with delineating fine-grained segmentationboundaries. To address this gap, we propose a more meticulous mask-levelalignment between 3D features and the 2D-text embedding space through across-modal mask reasoning framework, XMask3D. In our approach, we developed amask generator based on the denoising UNet from a pre-trained diffusion model,leveraging its capability for precise textual control over dense pixelrepresentations and enhancing the open-world adaptability of the generatedmasks. We further integrate 3D global features as implicit conditions into thepre-trained 2D denoising UNet, enabling the generation of segmentation maskswith additional 3D geometry awareness. Subsequently, the generated 2D masks areemployed to align mask-level 3D representations with the vision-languagefeature space, thereby augmenting the open vocabulary capability of 3D geometryembeddings. Finally, we fuse complementary 2D and 3D mask features, resultingin competitive performance across multiple benchmarks for 3D open vocabularysemantic segmentation. Code is available athttps://github.com/wangzy22/XMask3D.

Quick Read (beta)

loading the full paper ...