Abstract
We introduce Gaga, a framework that reconstructs and segments open-world 3Dscenes by leveraging inconsistent 2D masks predicted by zero-shotclass-agnostic segmentation models. Contrasted to prior 3D scene segmentationapproaches that rely on video object tracking or contrastive learning methods,Gaga utilizes spatial information and effectively associates object masksacross diverse camera poses through a novel 3D-aware memory bank. Byeliminating the assumption of continuous view changes in training images, Gagademonstrates robustness to variations in camera poses, particularly beneficialfor sparsely sampled images, ensuring precise mask label consistency.Furthermore, Gaga accommodates 2D segmentation masks from diverse sources anddemonstrates robust performance with different open-world zero-shotclass-agnostic segmentation models, significantly enhancing its versatility.Extensive qualitative and quantitative evaluations demonstrate that Gagaperforms favorably against state-of-the-art methods, emphasizing its potentialfor real-world applications such as 3D scene understanding and manipulation.