Abstract
We propose an efficient method to generate white-box adversarial examples totrick a character-level neural classifier. We find that only a fewmanipulations are needed to greatly decrease the accuracy. Our method relies onan atomic flip operation, which swaps one token for another, based on thegradients of the one-hot input vectors. Due to efficiency of our method, we canperform adversarial training which makes the model more robust to attacks attest time. With the use of a few semantics-preserving constraints, wedemonstrate that HotFlip can be adapted to attack a word-level classifier aswell.
Quick Read (beta)
loading the full paper ...