RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

Abstract

Latent-space monitors aim to detect undesirable behaviours in large languagemodels by leveraging internal model representations rather than relying solelyon black-box outputs. These methods have shown promise in identifyingbehaviours such as deception and unsafe completions, but a critical openquestion remains: can LLMs learn to evade such monitors? To study this, weintroduce RL-Obfuscation, in which LLMs are finetuned via reinforcementlearning to bypass latent-space monitors while maintaining coherentgenerations. We apply RL-Obfuscation to LLMs ranging from 7B to 14B parametersand evaluate evasion success against a suite of monitors. We find thattoken-level latent-space monitors are highly vulnerable to this attack. Moreholistic monitors, such as max-pooling or attention-based probes, remainrobust. Moreover, we show that adversarial policies trained to evade a singlestatic monitor generalise to unseen monitors of the same type. Finally, westudy how the policy learned by RL bypasses these monitors and find that themodel can also learn to repurpose tokens to mean something differentinternally.

Quick Read (beta)

loading the full paper ...