UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models

Abstract

Large Language Models (LLMs) are vulnerable to attacks like prompt injection,backdoor attacks, and adversarial attacks, which manipulate prompts or modelsto generate harmful outputs. In this paper, departing from traditional deeplearning attack paradigms, we explore their intrinsic relationship andcollectively term them Prompt Trigger Attacks (PTA). This raises a keyquestion: Can we determine if a prompt is benign or poisoned? To address this,we propose UniGuardian, the first unified defense mechanism designed to detectprompt injection, backdoor attacks, and adversarial attacks in LLMs.Additionally, we introduce a single-forward strategy to optimize the detectionpipeline, enabling simultaneous attack detection and text generation within asingle forward pass. Our experiments confirm that UniGuardian accurately andefficiently identifies malicious prompts in LLMs.

Quick Read (beta)

loading the full paper ...