Read Over the Lines: Attacking LLMs and Toxicity Detection Systems with ASCII Art to Mask Profanity

  • 2024-09-30 18:18:29
  • Sergey Berezin, Reza Farahbakhsh, Noel Crespi
  • 0

Abstract

We introduce a novel family of adversarial attacks that exploit the inabilityof language models to interpret ASCII art. To evaluate these attacks, wepropose the ToxASCII benchmark and develop two custom ASCII art fonts: oneleveraging special tokens and another using text-filled letter shapes. Ourattacks achieve a perfect 1.0 Attack Success Rate across ten models, includingOpenAI's o1-preview and LLaMA 3.1. Warning: this paper contains examples of toxic language used for researchpurposes.

 

Quick Read (beta)

loading the full paper ...