Abstract
Large Language Models (LLMs) are vulnerable to backdoor attacks, where hiddentriggers can maliciously manipulate model behavior. While several backdoorattack methods have been proposed, the mechanisms by which backdoor functionsoperate in LLMs remain underexplored. In this paper, we move beyond attackingLLMs and investigate backdoor functionality through the novel lens of naturallanguage explanations. Specifically, we leverage LLMs' generative capabilitiesto produce human-understandable explanations for their decisions, allowing usto compare explanations for clean and poisoned samples. We explore variousbackdoor attacks and embed the backdoor into LLaMA models for multiple tasks.Our experiments show that backdoored models produce higher-quality explanationsfor clean data compared to poisoned data, while generating significantly moreconsistent explanations for poisoned data than for clean data. We furtheranalyze the explanation generation process, revealing that at the token level,the explanation token of poisoned samples only appears in the final fewtransformer layers of the LLM. At the sentence level, attention dynamicsindicate that poisoned inputs shift attention from the input context whengenerating the explanation. These findings deepen our understanding of backdoorattack mechanisms in LLMs and offer a framework for detecting suchvulnerabilities through explainability techniques, contributing to thedevelopment of more secure LLMs.