Can Editing LLMs Inject Harm?

  • 2024-07-31 18:57:20
  • Canyu Chen, Baixiang Huang, Zekun Li, Zhaorun Chen, Shiyang Lai, Xiongxiao Xu, Jia-Chen Gu, Jindong Gu, Huaxiu Yao, Chaowei Xiao, Xifeng Yan, William Yang Wang, Philip Torr, Dawn Song, Kai Shu
  • 0

Abstract

Knowledge editing techniques have been increasingly adopted to efficientlycorrect the false or outdated knowledge in Large Language Models (LLMs), due tothe high cost of retraining from scratch. Meanwhile, one critical butunder-explored question is: can knowledge editing be used to inject harm intoLLMs? In this paper, we propose to reformulate knowledge editing as a new typeof safety threat for LLMs, namely Editing Attack, and conduct a systematicinvestigation with a newly constructed dataset EditAttack. Specifically, wefocus on two typical safety risks of Editing Attack including MisinformationInjection and Bias Injection. For the risk of misinformation injection, wefirst categorize it into commonsense misinformation injection and long-tailmisinformation injection. Then, we find that editing attacks can inject bothtypes of misinformation into LLMs, and the effectiveness is particularly highfor commonsense misinformation injection. For the risk of bias injection, wediscover that not only can biased sentences be injected into LLMs with higheffectiveness, but also one single biased sentence injection can cause a biasincrease in general outputs of LLMs, which are even highly irrelevant to theinjected sentence, indicating a catastrophic impact on the overall fairness ofLLMs. Then, we further illustrate the high stealthiness of editing attacks,measured by their impact on the general knowledge and reasoning capacities ofLLMs, and show the hardness of defending editing attacks with empiricalevidence. Our discoveries demonstrate the emerging misuse risks of knowledgeediting techniques on compromising the safety alignment of LLMs.

 

Quick Read (beta)

loading the full paper ...