Security Tensors as a Cross-Modal Bridge: Extending Text-Aligned Safety to Vision in LVLM

Abstract

Large visual-language models (LVLMs) integrate aligned large language models(LLMs) with visual modules to process multimodal inputs. However, the safetymechanisms developed for text-based LLMs do not naturally extend to visualmodalities, leaving LVLMs vulnerable to harmful image inputs. To address thiscross-modal safety gap, we introduce security tensors - trainable input vectorsapplied during inference through either the textual or visual modality. Thesetensors transfer textual safety alignment to visual processing withoutmodifying the model's parameters. They are optimized using a curated datasetcontaining (i) malicious image-text pairs requiring rejection, (ii) contrastivebenign pairs with text structurally similar to malicious queries, with thepurpose of being contrastive examples to guide visual reliance, and (iii)general benign samples preserving model functionality. Experimental resultsdemonstrate that both textual and visual security tensors significantly enhanceLVLMs' ability to reject diverse harmful visual inputs while maintainingnear-identical performance on benign tasks. Further internal analysis towardshidden-layer representations reveals that security tensors successfullyactivate the language module's textual "safety layers" in visual inputs,thereby effectively extending text-based safety to the visual modality.

Quick Read (beta)

loading the full paper ...