Abstract
Large Language Models (LLMs) are deployed in interactive contexts with directuser engagement, such as chatbots and writing assistants. These deployments arevulnerable to prompt injection and jailbreaking (collectively, prompt hacking),in which models are manipulated to ignore their original instructions andfollow potentially malicious ones. Although widely acknowledged as asignificant security threat, there is a dearth of large-scale resources andquantitative studies on prompt hacking. To address this lacuna, we launch aglobal prompt hacking competition, which allows for free-form human inputattacks. We elicit 600K+ adversarial prompts against three state-of-the-artLLMs. We describe the dataset, which empirically verifies that current LLMs canindeed be manipulated via prompt hacking. We also present a comprehensivetaxonomical ontology of the types of adversarial prompts.