Generating Label Cohesive and Well-Formed Adversarial Claims

Abstract

Adversarial attacks reveal important vulnerabilities and flaws of trainedmodels. One potent type of attack are universal adversarial triggers, which areindividual n-grams that, when appended to instances of a class under attack,can trick a model into predicting a target class. However, for inference taskssuch as fact checking, these triggers often inadvertently invert the meaning ofinstances they are inserted in. In addition, such attacks produce semanticallynonsensical inputs, as they simply concatenate triggers to existing samples.Here, we investigate how to generate adversarial attacks against fact checkingsystems that preserve the ground truth meaning and are semantically valid. Weextend the HotFlip attack algorithm used for universal trigger generation byjointly minimising the target class loss of a fact checking model and theentailment class loss of an auxiliary natural language inference model. We thentrain a conditional language model to generate semantically valid statements,which include the found universal triggers. We find that the generated attacksmaintain the directionality and semantic validity of the claim better thanprevious work.

Quick Read (beta)

loading the full paper ...