SecAlign: Defending Against Prompt Injection with Preference Optimization

Abstract

Large language models (LLMs) are becoming increasingly prevalent in modernsoftware systems, interfacing between the user and the Internet to assist withtasks that require advanced language understanding. To accomplish these tasks,the LLM often uses external data sources such as user documents, web retrieval,results from API calls, etc. This opens up new avenues for attackers tomanipulate the LLM via prompt injection. Adversarial prompts can be injectedinto external data sources to override the system's intended instruction andinstead execute a malicious instruction. To mitigate this vulnerability, we propose a new defense called SecAlignbased on the technique of preference optimization. Our defense first constructsa preference dataset with prompt-injected inputs, secure outputs (ones thatrespond to the legitimate instruction), and insecure outputs (ones that respondto the injection). We then perform preference optimization on this dataset toteach the LLM to prefer the secure output over the insecure one. This providesthe first known method that reduces the success rates of various promptinjections to around 0%, even against attacks much more sophisticated than onesseen during training. This indicates our defense generalizes well againstunknown and yet-to-come attacks. Also, our defended models are still practicalwith similar utility to the one before our defensive training. Our code is athttps://github.com/facebookresearch/SecAlign

Quick Read (beta)

loading the full paper ...