Abstract
Multimodal co-embedding models, especially CLIP, have advanced the state ofthe art in zero-shot classification and multimedia information retrieval inrecent years by aligning images and text in a shared representation space.However, such modals trained on a contrastive alignment can lack stabilitytowards small input perturbations. Especially when dealing with manuallyexpressed queries, minor variations in the query can cause large differences inthe ranking of the best-matching results. In this paper, we present asystematic analysis of the effect of multiple classes of non-semantic queryperturbations in an multimedia information retrieval scenario. We evaluate adiverse set of lexical, syntactic, and semantic perturbations across multipleCLIP variants using the TRECVID Ad-Hoc Video Search queries and the V3C1 videocollection. Across models, we find that syntactic and semantic perturbationsdrive the largest instabilities, while brittleness is concentrated in trivialsurface edits such as punctuation and case. Our results highlight robustness asa critical dimension for evaluating vision-language models beyond benchmarkaccuracy.