Attacking Misinformation Detection Using Adversarial Examples Generated by Language Models

Abstract

We investigate the challenge of generating adversarial examples to test therobustness of text classification algorithms detecting low-credibility content,including propaganda, false claims, rumours and hyperpartisan news. We focus onsimulation of content moderation by setting realistic limits on the number ofqueries an attacker is allowed to attempt. Within our solution (TREPAT),initial rephrasings are generated by large language models with promptsinspired by meaning-preserving NLP tasks, e.g. text simplification and styletransfer. Subsequently, these modifications are decomposed into small changes,applied through beam search procedure until the victim classifier changes itsdecision. The evaluation confirms the superiority of our approach in theconstrained scenario, especially in case of long input text (news articles),where exhaustive search is not feasible.

Quick Read (beta)

loading the full paper ...