HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States

Abstract

The integration of additional modalities increases the susceptibility oflarge vision-language models (LVLMs) to safety risks, such as jailbreakattacks, compared to their language-only counterparts. While existing researchprimarily focuses on post-hoc alignment techniques, the underlying safetymechanisms within LVLMs remain largely unexplored. In this work , weinvestigate whether LVLMs inherently encode safety-relevant signals withintheir internal activations during inference. Our findings reveal that LVLMsexhibit distinct activation patterns when processing unsafe prompts, which canbe leveraged to detect and mitigate adversarial inputs without requiringextensive fine-tuning. Building on this insight, we introduce HiddenDetect, anovel tuning-free framework that harnesses internal model activations toenhance safety. Experimental results show that {HiddenDetect} surpassesstate-of-the-art methods in detecting jailbreak attacks against LVLMs. Byutilizing intrinsic safety-aware patterns, our method provides an efficient andscalable solution for strengthening LVLM robustness against multimodal threats.Our code will be released publicly athttps://github.com/leigest519/HiddenDetect.

Quick Read (beta)

loading the full paper ...