Read Over the Lines: Attacking LLMs and Toxicity Detection Systems with ASCII Art to Mask Profanity

Abstract

We introduce a novel family of adversarial attacks that exploit the inabilityof language models to interpret ASCII art. To evaluate these attacks, wepropose the ToxASCII benchmark and develop two custom ASCII art fonts: oneleveraging special tokens and another using text-filled letter shapes. Ourattacks achieve a perfect 1.0 Attack Success Rate across ten models, includingOpenAI's o1-preview and LLaMA 3.1. Warning: this paper contains examples of toxic language used for researchpurposes.

Quick Read (beta)

loading the full paper ...