Abstract
We present a novel class of jailbreak adversarial attacks on LLMs, termedTask-in-Prompt (TIP) attacks. Our approach embeds sequence-to-sequence tasks(e.g., cipher decoding, riddles, code execution) into the model's prompt toindirectly generate prohibited inputs. To systematically assess theeffectiveness of these attacks, we introduce the PHRYGE benchmark. Wedemonstrate that our techniques successfully circumvent safeguards in sixstate-of-the-art language models, including GPT-4o and LLaMA 3.2. Our findingshighlight critical weaknesses in current LLM safety alignments and underscorethe urgent need for more sophisticated defence strategies. Warning: this paper contains examples of unethical inquiries used solely forresearch purposes.
Quick Read (beta)
loading the full paper ...