Sparse vs Contiguous Adversarial Pixel Perturbations in Multimodal Models: An Empirical Analysis

Abstract

Assessing the robustness of multimodal models against adversarial examples isan important aspect for the safety of its users. We craft L0-norm perturbationattacks on the preprocessed input images. We launch them in a black-box setupagainst four multimodal models and two unimodal DNNs, considering both targetedand untargeted misclassification. Our attacks target less than 0.04% ofperturbed image area and integrate different spatial positioning of perturbedpixels: sparse positioning and pixels arranged in different contiguous shapes(row, column, diagonal, and patch). To the best of our knowledge, we are thefirst to assess the robustness of three state-of-the-art multimodal models(ALIGN, AltCLIP, GroupViT) against different sparse and contiguous pixeldistribution perturbations. The obtained results indicate that unimodal DNNsare more robust than multimodal models. Furthermore, models using CNN-basedImage Encoder are more vulnerable than models with ViT - for untargetedattacks, we obtain a 99% success rate by perturbing less than 0.02% of theimage area.

Quick Read (beta)

loading the full paper ...