Abstract
Recent Open-Vocabulary Semantic Segmentation (OVSS) models extend the CLIPmodel to segmentation while maintaining the use of multiple templates (e.g., aphoto of <class>, a sketch of a <class>, etc.) for constructing class-wiseaveraged text embeddings, acting as a classifier. In this paper, we challengethis status quo and investigate the impact of templates for OVSS. Empirically,we observe that for each class, there exist single-template classifierssignificantly outperforming the conventional averaged classifier. We refer tothem as class-experts. Given access to unlabeled images and without anytraining involved, we estimate these experts by leveraging the class-wiseprediction entropy of single-template classifiers, selecting as class-wiseexperts those which yield the lowest entropy. All experts, each specializing ina specific class, collaborate in a newly proposed fusion method to generatemore accurate OVSS predictions. Our plug-and-play method, coined FLOSS, isorthogonal and complementary to existing OVSS methods, offering a ''freelunch'' to systematically improve OVSS without labels and additional training.Extensive experiments demonstrate that FLOSS consistently boostsstate-of-the-art methods on various OVSS benchmarks. Moreover, the selectedexpert templates can generalize well from one dataset to others sharing thesame semantic categories, yet exhibiting distribution shifts. Additionally, weobtain satisfactory improvements under a low-data regime, where only a fewunlabeled images are available. Our code is available athttps://github.com/yasserben/FLOSS .