Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition

  • 2024-03-03 00:12:16
  • Sander Schulhoff, Jeremy Pinto, Anaum Khan, Louis-François Bouchard, Chenglei Si, Svetlina Anati, Valen Tagliabue, Anson Liu Kost, Christopher Carnahan, Jordan Boyd-Graber
  • 0

Abstract

Large Language Models (LLMs) are deployed in interactive contexts with directuser engagement, such as chatbots and writing assistants. These deployments arevulnerable to prompt injection and jailbreaking (collectively, prompt hacking),in which models are manipulated to ignore their original instructions andfollow potentially malicious ones. Although widely acknowledged as asignificant security threat, there is a dearth of large-scale resources andquantitative studies on prompt hacking. To address this lacuna, we launch aglobal prompt hacking competition, which allows for free-form human inputattacks. We elicit 600K+ adversarial prompts against three state-of-the-artLLMs. We describe the dataset, which empirically verifies that current LLMs canindeed be manipulated via prompt hacking. We also present a comprehensivetaxonomical ontology of the types of adversarial prompts.

 

Quick Read (beta)

loading the full paper ...