Abstract
Person re-identification (re-id) models are vital in security surveillancesystems, requiring transferable adversarial attacks to explore thevulnerabilities of them. Recently, vision-language models (VLM) based attackshave shown superior transferability by attacking generalized image and textualfeatures of VLM, but they lack comprehensive feature disruption due to theoveremphasis on discriminative semantics in integral representation. In thispaper, we introduce the Attribute-aware Prompt Attack (AP-Attack), a novelmethod that leverages VLM's image-text alignment capability to explicitlydisrupt fine-grained semantic features of pedestrian images by destroyingattribute-specific textual embeddings. To obtain personalized textualdescriptions for individual attributes, textual inversion networks are designedto map pedestrian images to pseudo tokens that represent semantic embeddings,trained in the contrastive learning manner with images and a predefined prompttemplate that explicitly describes the pedestrian attributes. Inverted benignand adversarial fine-grained textual semantics facilitate attacker ineffectively conducting thorough disruptions, enhancing the transferability ofadversarial examples. Extensive experiments show that AP-Attack achievesstate-of-the-art transferability, significantly outperforming previous methodsby 22.9% on mean Drop Rate in cross-model&dataset attack scenarios.