Abstract
We present LSeg, a novel model for language-driven semantic imagesegmentation. LSeg uses a text encoder to compute embeddings of descriptiveinput labels (e.g., "grass" or "building") together with a transformer-basedimage encoder that computes dense per-pixel embeddings of the input image. Theimage encoder is trained with a contrastive objective to align pixel embeddingsto the text embedding of the corresponding semantic class. The text embeddingsprovide a flexible label representation in which semantically similar labelsmap to similar regions in the embedding space (e.g., "cat" and "furry"). Thisallows LSeg to generalize to previously unseen categories at test time, withoutretraining or even requiring a single additional training sample. Wedemonstrate that our approach achieves highly competitive zero-shot performancecompared to existing zero- and few-shot semantic segmentation methods, and evenmatches the accuracy of traditional segmentation algorithms when a fixed labelset is provided. Code and demo are available athttps://github.com/isl-org/lang-seg.