Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models

Abstract

Conventional language model (LM) safety alignment relies on a reactive,disjoint procedure: attackers exploit a static model, followed by defensivefine-tuning to patch exposed vulnerabilities. This sequential approach createsa mismatch -- attackers overfit to obsolete defenses, while defendersperpetually lag behind emerging threats. To address this, we proposeSelf-RedTeam, an online self-play reinforcement learning algorithm where anattacker and defender agent co-evolve through continuous interaction. We castsafety alignment as a two-player zero-sum game, where a single model alternatesbetween attacker and defender roles -- generating adversarial prompts andsafeguarding against them -- while a reward LM adjudicates outcomes. Thisenables dynamic co-adaptation. Grounded in the game-theoretic framework ofzero-sum games, we establish a theoretical safety guarantee which motivates thedesign of our method: if self-play converges to a Nash Equilibrium, thedefender will reliably produce safe responses to any adversarial input.Empirically, Self-RedTeam uncovers more diverse attacks (+21.8% SBERT) comparedto attackers trained against static defenders and achieves higher robustness onsafety benchmarks (e.g., +65.5% on WildJailBreak) than defenders trainedagainst static attackers. We further propose hidden Chain-of-Thought, allowingagents to plan privately, which boosts adversarial diversity and reducesover-refusals. Our results motivate a shift from reactive patching to proactiveco-evolution in LM safety training, enabling scalable, autonomous, and robustself-improvement of LMs via multi-agent reinforcement learning (MARL).

Quick Read (beta)

loading the full paper ...