Sparse Black-box Video Attack with Reinforcement Learning

Abstract

Adversarial attacks on video recognition models have been explored recently.However, most existing works treat each video frame equally and ignore theirtemporal interactions. To overcome this drawback, a few methods try to selectsome key frames, and then perform attacks based on them. Unfortunately, theirselecting strategy is independent with the attacking step, therefore theresulting performance is limited. In this paper, we aim to attack videorecognition task in the black-box setting. The difference is, we think theframe selection phase is closely relevant with the attacking phase. Thereasonable key frames should be adjusted according to the feedback of attackingthreat models. Based on this idea, we formulate the black-box video attacksinto the Reinforcement Learning (RL) framework. Specifically, the environmentin RL is set as the threat models, and the agent in RL plays the role of frameselecting and video attacking simultaneously. By continuously querying thethreat models and receiving the feedback of predicted probabilities (reward),the agent adjusts its frame selection strategy and performs attacks (action).Step by step, the optimal key frames are selected and the smallest adversarialperturbations are achieved. We conduct a series of experiments with twomainstream video recognition models: C3D and LRCN on the public UCF-101 andHMDB-51 datasets. The results demonstrate that the proposed method cansignificantly reduce the perturbation of adversarial examples and attacking onthe sparse video frames can have better attack effectiveness than attacking oneach frame.

Quick Read (beta)

loading the full paper ...