A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measures

Abstract

Large Language Models (LLMs), which bridge the gap between human languageunderstanding and complex problem-solving, achieve state-of-the-art performanceon several NLP tasks, particularly in few-shot and zero-shot settings. Despitethe demonstrable efficacy of LLMs, due to constraints on computationalresources, users have to engage with open-source language models or outsourcethe entire training process to third-party platforms. However, research hasdemonstrated that language models are susceptible to potential securityvulnerabilities, particularly in backdoor attacks. Backdoor attacks aredesigned to introduce targeted vulnerabilities into language models bypoisoning training samples or model weights, allowing attackers to manipulatemodel responses through malicious triggers. While existing surveys on backdoorattacks provide a comprehensive overview, they lack an in-depth examination ofbackdoor attacks specifically targeting LLMs. To bridge this gap and grasp thelatest trends in the field, this paper presents a novel perspective on backdoorattacks for LLMs by focusing on fine-tuning methods. Specifically, wesystematically classify backdoor attacks into three categories: full-parameterfine-tuning, parameter-efficient fine-tuning, and no fine-tuning Based oninsights from a substantial review, we also discuss crucial issues for futureresearch on backdoor attacks, such as further exploring attack algorithms thatdo not require fine-tuning, or developing more covert attack algorithms.

Quick Read (beta)

loading the full paper ...