Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning

Abstract

Visual language grounding is widely studied in modern neural image captioningsystems, which typically adopts an encoder-decoder framework consisting of twoprincipal components: a convolutional neural network (CNN) for image featureextraction and a recurrent neural network (RNN) for language captiongeneration. To study the robustness of language grounding to adversarialperturbations in machine vision and perception, we propose Show-and-Fool, anovel algorithm for crafting adversarial examples in neural image captioning.The proposed algorithm provides two evaluation approaches, which check whetherneural image captioning systems can be mislead to output some randomly chosencaptions or keywords. Our extensive experiments show that our algorithm cansuccessfully craft visually-similar adversarial examples with randomly targetedcaptions or keywords, and the adversarial examples can be made highlytransferable to other image captioning systems. Consequently, our approachleads to new robustness implications of neural image captioning and novelinsights in visual language grounding.

Quick Read (beta)

loading the full paper ...