Abstract
Embodied tasks require the agent to fully understand 3D scenes simultaneouslywith its exploration, so an online, real-time, fine-grained andhighly-generalized 3D perception model is desperately needed. Sincehigh-quality 3D data is limited, directly training such a model in 3D is almostinfeasible. Meanwhile, vision foundation models (VFM) has revolutionized thefield of 2D computer vision with superior performance, which makes the use ofVFM to assist embodied 3D perception a promising direction. However, mostexisting VFM-assisted 3D perception methods are either offline or too slow thatcannot be applied in practical embodied tasks. In this paper, we aim toleverage Segment Anything Model (SAM) for real-time 3D instance segmentation inan online setting. This is a challenging problem since future frames are notavailable in the input streaming RGB-D video, and an instance may be observedin several frames so object matching between frames is required. To addressthese challenges, we first propose a geometric-aware query lifting module torepresent the 2D masks generated by SAM by 3D-aware queries, which is theniteratively refined by a dual-level query decoder. In this way, the 2D masksare transferred to fine-grained shapes on 3D point clouds. Benefit from thequery representation for 3D masks, we can compute the similarity matrix betweenthe 3D masks from different views by efficient matrix operation, which enablesreal-time inference. Experiments on ScanNet, ScanNet200, SceneNN and 3RScanshow our method achieves leading performance even compared with offlinemethods. Our method also demonstrates great generalization ability in severalzero-shot dataset transferring experiments and show great potential inopen-vocabulary and data-efficient setting. Code and demo are available athttps://xuxw98.github.io/ESAM/, with only one RTX 3090 GPU required fortraining and evaluation.