ConformalSAM: Unlocking the Potential of Foundational Segmentation Models in Semi-Supervised Semantic Segmentation with Conformal Prediction

Abstract

Pixel-level vision tasks, such as semantic segmentation, require extensiveand high-quality annotated data, which is costly to obtain. Semi-supervisedsemantic segmentation (SSSS) has emerged as a solution to alleviate thelabeling burden by leveraging both labeled and unlabeled data throughself-training techniques. Meanwhile, the advent of foundational segmentationmodels pre-trained on massive data, has shown the potential to generalizeacross domains effectively. This work explores whether a foundationalsegmentation model can address label scarcity in the pixel-level vision task asan annotator for unlabeled images. Specifically, we investigate the efficacy ofusing SEEM, a Segment Anything Model (SAM) variant fine-tuned for textualinput, to generate predictive masks for unlabeled data. To address theshortcomings of using SEEM-generated masks as supervision, we proposeConformalSAM, a novel SSSS framework which first calibrates the foundationmodel using the target domain's labeled data and then filters out unreliablepixel labels of unlabeled data so that only high-confidence labels are used assupervision. By leveraging conformal prediction (CP) to adapt foundation modelsto target data through uncertainty calibration, ConformalSAM exploits thestrong capability of the foundational segmentation model reliably whichbenefits the early-stage learning, while a subsequent self-reliance trainingstrategy mitigates overfitting to SEEM-generated masks in the later trainingstage. Our experiment demonstrates that, on three standard benchmarks of SSSS,ConformalSAM achieves superior performance compared to recent SSSS methods andhelps boost the performance of those methods as a plug-in.

Quick Read (beta)

loading the full paper ...