Abstract
We investigate the challenge of generating adversarial examples to test therobustness of text classification algorithms detecting low-credibility content,including propaganda, false claims, rumours and hyperpartisan news. We focus onsimulation of content moderation by setting realistic limits on the number ofqueries an attacker is allowed to attempt. Within our solution (TREPAT),initial rephrasings are generated by large language models with promptsinspired by meaning-preserving NLP tasks, e.g. text simplification and styletransfer. Subsequently, these modifications are decomposed into small changes,applied through beam search procedure until the victim classifier changes itsdecision. The evaluation confirms the superiority of our approach in theconstrained scenario, especially in case of long input text (news articles),where exhaustive search is not feasible.