Abstract
While effective backdoor detection and inversion schemes have been developedfor AIs used e.g. for images, there are challenges in "porting" these methodsto LLMs. First, the LLM input space is discrete, which precludes gradient-basedsearch over this space, central to many backdoor inversion methods. Second,there are ~30,000^k k-tuples to consider, k the token-length of a putativetrigger. Third, for LLMs there is the need to blacklist tokens that have strongmarginal associations with the putative target response (class) of an attack,as such tokens give false detection signals. However, good blacklists may notexist for some domains. We propose a LLM trigger inversion approach with threekey components: i) discrete search, with putative triggers greedily accreted,starting from a select list of singletons; ii) implicit blacklisting, achievedby evaluating the average cosine similarity, in activation space, between acandidate trigger and a small clean set of samples from the putative targetclass; iii) detection when a candidate trigger elicits high misclassifications,and with unusually high decision confidence. Unlike many recent works, wedemonstrate that our approach reliably detects and successfully invertsground-truth backdoor trigger phrases.