Abstract
Vision-Language Models (VLMs) have shown impressive performance in visiontasks, but adapting them to new domains often requires expensive fine-tuning.Prompt tuning techniques, including textual, visual, and multimodal prompting,offer efficient alternatives by leveraging learnable prompts. However, theirapplication to Vision-Language Segmentation Models (VLSMs) and evaluation undersignificant domain shifts remain unexplored. This work presents an open-sourcebenchmarking framework, TuneVLSeg, to integrate various unimodal and multimodalprompt tuning techniques into VLSMs, making prompt tuning usable for downstreamsegmentation datasets with any number of classes. TuneVLSeg includes $6$ prompttuning strategies on various prompt depths used in $2$ VLSMs totaling of $8$different combinations. We test various prompt tuning on $8$ diverse medicaldatasets, including $3$ radiology datasets (breast tumor, echocardiograph,chest X-ray pathologies) and $5$ non-radiology datasets (polyp, ulcer, skincancer), and two natural domain segmentation datasets. Our study found thattextual prompt tuning struggles under significant domain shifts, fromnatural-domain images to medical data. Furthermore, visual prompt tuning, withfewer hyperparameters than multimodal prompt tuning, often achieves performancecompetitive to multimodal approaches, making it a valuable first attempt. Ourwork advances the understanding and applicability of different prompt-tuningtechniques for robust domain-specific segmentation. The source code isavailable at https://github.com/naamiinepal/tunevlseg.