GRES: Generalized Referring Expression Segmentation

Abstract

Referring Expression Segmentation (RES) aims to generate a segmentation maskfor the object described by a given language expression. Existing classic RESdatasets and methods commonly support single-target expressions only, i.e., oneexpression refers to one target object. Multi-target and no-target expressionsare not considered. This limits the usage of RES in practice. In this paper, weintroduce a new benchmark called Generalized Referring Expression Segmentation(GRES), which extends the classic RES to allow expressions to refer to anarbitrary number of target objects. Towards this, we construct the firstlarge-scale GRES dataset called gRefCOCO that contains multi-target, no-target,and single-target expressions. GRES and gRefCOCO are designed to bewell-compatible with RES, facilitating extensive experiments to study theperformance gap of the existing RES methods on the GRES task. In theexperimental study, we find that one of the big challenges of GRES is complexrelationship modeling. Based on this, we propose a region-based GRES baselineReLA that adaptively divides the image into regions with sub-instance clues,and explicitly models the region-region and region-language dependencies. Theproposed approach ReLA achieves new state-of-the-art performance on the bothnewly proposed GRES and classic RES tasks. The proposed gRefCOCO dataset andmethod are available at https://henghuiding.github.io/GRES.

Quick Read (beta)

loading the full paper ...