PolicyCleanse: Backdoor Detection and Mitigation in Reinforcement Learning

Abstract

While real-world applications of reinforcement learning are becoming popular,the security and robustness of RL systems are worthy of more attention andexploration. In particular, recent works have revealed that, in a multi-agentRL environment, backdoor trigger actions can be injected into a victim agent(a.k.a. Trojan agent), which can result in a catastrophic failure as soon as itsees the backdoor trigger action. To ensure the security of RL agents againstmalicious backdoors, in this work, we propose the problem of Backdoor Detectionin a multi-agent competitive reinforcement learning system, with the objectiveof detecting Trojan agents as well as the corresponding potential triggeractions, and further trying to mitigate their Trojan behavior. In order tosolve this problem, we propose PolicyCleanse that is based on the property thatthe activated Trojan agents accumulated rewards degrade noticeably afterseveral timesteps. Along with PolicyCleanse, we also design a machineunlearning-based approach that can effectively mitigate the detected backdoor.Extensive experiments demonstrate that the proposed methods can accuratelydetect Trojan agents, and outperform existing backdoor mitigation baselineapproaches by at least 3% in winning rate across various types of agents andenvironments.

Quick Read (beta)

loading the full paper ...