Robust Conversational Agents against Imperceptible Toxicity Triggers

Abstract

Warning: this paper contains content that maybe offensive or upsetting.Recent research in Natural Language Processing (NLP) has advanced thedevelopment of various toxicity detection models with the intention ofidentifying and mitigating toxic language from existing systems. Despite theabundance of research in this area, less attention has been given toadversarial attacks that force the system to generate toxic language and thedefense against them. Existing work to generate such attacks is either based onhuman-generated attacks which is costly and not scalable or, in case ofautomatic attacks, the attack vector does not conform to human-like language,which can be detected using a language model loss. In this work, we proposeattacks against conversational agents that are imperceptible, i.e., they fitthe conversation in terms of coherency, relevancy, and fluency, while they areeffective and scalable, i.e., they can automatically trigger the system intogenerating toxic language. We then propose a defense mechanism against suchattacks which not only mitigates the attack but also attempts to maintain theconversational flow. Through automatic and human evaluations, we show that ourdefense is effective at avoiding toxic language generation even againstimperceptible toxicity triggers while the generated language fits theconversation in terms of coherency and relevancy. Lastly, we establish thegeneralizability of such a defense mechanism on language generation modelsbeyond conversational agents.

Quick Read (beta)

loading the full paper ...