Abstract
Large language models (LLMs) are shown to be vulnerable to jailbreakingattacks where adversarial prompts are designed to elicit harmful responses.While existing defenses effectively mitigate single-turn attacks by detectingand filtering unsafe inputs, they fail against multi-turn jailbreaks thatexploit contextual drift over multiple interactions, gradually leading LLMsaway from safe behavior. To address this challenge, we propose a safetysteering framework grounded in safe control theory, ensuring invariant safetyin multi-turn dialogues. Our approach models the dialogue with LLMs usingstate-space representations and introduces a novel neural barrier function(NBF) to detect and filter harmful queries emerging from evolving contextsproactively. Our method achieves invariant safety at each turn of dialogue bylearning a safety predictor that accounts for adversarial queries, preventingpotential context drift toward jailbreaks. Extensive experiments under multipleLLMs show that our NBF-based safety steering outperforms safety alignment,prompt-based steering and lightweight LLM guardrails baselines, offeringstronger defenses against multi-turn jailbreaks while maintaining a bettertrade-off among safety, helpfulness and over-refusal. Check out the websitehere https://sites.google.com/view/llm-nbf/home . Our code is available onhttps://github.com/HanjiangHu/NBF-LLM .