Abstract
Vision-language pre-training (VLP) models, trained on large-scale image-textpairs, have become widely used across a variety of downstreamvision-and-language (V+L) tasks. This widespread adoption raises concerns abouttheir vulnerability to adversarial attacks. Non-universal adversarial attacks,while effective, are often impractical for real-time online applications due totheir high computational demands per data instance. Recently, universaladversarial perturbations (UAPs) have been introduced as a solution, butexisting generator-based UAP methods are significantly time-consuming. Toovercome the limitation, we propose a direct optimization-based UAP approach,termed DO-UAP, which significantly reduces resource consumption whilemaintaining high attack performance. Specifically, we explore the necessity ofmultimodal loss design and introduce a useful data augmentation strategy.Extensive experiments conducted on three benchmark VLP datasets, six popularVLP models, and three classical downstream tasks demonstrate the efficiency andeffectiveness of DO-UAP. Specifically, our approach drastically decreases thetime consumption by 23-fold while achieving a better attack performance.