GS: Generative Segmentation via Label Diffusion

  • 2025-08-27 16:28:15
  • Yuhao Chen, Shubin Chen, Liang Lin, Guangrun Wang
  • 0

Abstract

Language-driven image segmentation is a fundamental task in vision-languageunderstanding, requiring models to segment regions of an image corresponding tonatural language expressions. Traditional methods approach this as adiscriminative problem, assigning each pixel to foreground or background basedon semantic alignment. Recently, diffusion models have been introduced to thisdomain, but existing approaches remain image-centric: they either (i) use imagediffusion models as visual feature extractors, (ii) synthesize segmentationdata via image generation to train discriminative models, or (iii) performdiffusion inversion to extract attention cues from pre-trained image diffusionmodels-thereby treating segmentation as an auxiliary process. In this paper, wepropose GS (Generative Segmentation), a novel framework that formulatessegmentation itself as a generative task via label diffusion. Instead ofgenerating images conditioned on label maps and text, GS reverses thegenerative process: it directly generates segmentation masks from noise,conditioned on both the input image and the accompanying language description.This paradigm makes label generation the primary modeling target, enablingend-to-end training with explicit control over spatial and semantic fidelity.To demonstrate the effectiveness of our approach, we evaluate GS on PanopticNarrative Grounding (PNG), a representative and challenging benchmark formultimodal segmentation that requires panoptic-level reasoning guided bynarrative captions. Experimental results show that GS significantly outperformsexisting discriminative and diffusion-based methods, setting a newstate-of-the-art for language-driven segmentation.

 

Quick Read (beta)

loading the full paper ...