ASIDE: Architectural Separation of Instructions and Data in Language Models

Abstract

Despite their remarkable performance, large language models lack elementarysafety features, and this makes them susceptible to numerous malicious attacks.In particular, previous work has identified the absence of an intrinsicseparation between instructions and data as a root cause for the success ofprompt injection attacks. In this work, we propose a method, ASIDE, that allowsthe model to clearly separate between instructions and data on the level ofembeddings. ASIDE applies a fixed orthogonal rotation to the embeddings of datatokens, thus creating distinct representations of instructions and data tokenswithout introducing any additional parameters. We demonstrate the effectivenessof our method by instruct-tuning LLMs with ASIDE and showing (1) highlyincreased instruction-data separation scores without a loss in modelcapabilities and (2) competitive results on prompt injection benchmarks, evenwithout dedicated safety training. Additionally, we study the working mechanismbehind our method through an analysis of model representations.

Quick Read (beta)

loading the full paper ...