Abstract
Recent image segmentation models have advanced to segment images intohigh-quality masks for visual entities, and yet they cannot providecomprehensive semantic understanding for complex queries based on both languageand vision. This limitation reduces their effectiveness in applications thatrequire user-friendly interactions driven by vision-language prompts. To bridgethis gap, we introduce a novel task of omnimodal referring expressionsegmentation (ORES). In this task, a model produces a group of masks based onarbitrary prompts specified by text only or text plus reference visualentities. To address this new challenge, we propose a novel framework to "Referto Any Segmentation Mask Group" (RAS), which augments segmentation models withcomplex multimodal interactions and comprehension via a mask-centric largemultimodal model. For training and benchmarking ORES models, we create datasetsMaskGroups-2M and MaskGroups-HQ to include diverse mask groups specified bytext and reference entities. Through extensive evaluation, we demonstratesuperior performance of RAS on our new ORES task, as well as classic referringexpression segmentation (RES) and generalized referring expression segmentation(GRES) tasks. Project page: https://Ref2Any.github.io.