Abstract
Due to the impressive zero-shot capabilities, pre-trained vision-languagemodels (e.g. CLIP), have attracted widespread attention and adoption acrossvarious domains. Nonetheless, CLIP has been observed to be susceptible toadversarial examples. Through experimental analysis, we have observed aphenomenon wherein adversarial perturbations induce shifts in text-guidedattention. Building upon this observation, we propose a simple yet effectivestrategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). Thisframework incorporates two components: the Attention Refinement module and theAttention-based Model Constraint module. Our goal is to maintain thegeneralization of the CLIP model and enhance its adversarial robustness: TheAttention Refinement module aligns the text-guided attention obtained from thetarget model via adversarial examples with the text-guided attention acquiredfrom the original model via clean examples. This alignment enhances the model'srobustness. Additionally, the Attention-based Model Constraint module acquirestext-guided attention from both the target and original models using cleanexamples. Its objective is to maintain model performance on clean samples whileenhancing overall robustness. The experiments validate that our method yields a9.58% enhancement in zero-shot robust accuracy over the currentstate-of-the-art techniques across 16 datasets. Our code is available athttps://github.com/zhyblue424/TGA-ZSR.