Abstract
Multimodal large language models (MLLMs) have shown impressive capabilitiesin vision-language tasks such as reasoning segmentation, where models generatesegmentation masks based on textual queries. While prior work has primarilyfocused on perturbing image inputs, semantically equivalent textualparaphrases-crucial in real-world applications where users express the sameintent in varied ways-remain underexplored. To address this gap, we introduce anovel adversarial paraphrasing task: generating grammatically correctparaphrases that preserve the original query meaning while degradingsegmentation performance. To evaluate the quality of adversarial paraphrases,we develop a comprehensive automatic evaluation protocol validated with humanstudies. Furthermore, we introduce SPARTA-a black-box, sentence-leveloptimization method that operates in the low-dimensional semantic latent spaceof a text autoencoder, guided by reinforcement learning. SPARTA achievessignificantly higher success rates, outperforming prior methods by up to 2x onboth the ReasonSeg and LLMSeg-40k datasets. We use SPARTA and competitivebaselines to assess the robustness of advanced reasoning segmentation models.We reveal that they remain vulnerable to adversarial paraphrasing-even understrict semantic and grammatical constraints. All code and data will be releasedpublicly upon acceptance.