Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models

Abstract

The safety alignment ability of Vision-Language Models (VLMs) is prone to bedegraded by the integration of the vision module compared to its LLM backbone.We investigate this phenomenon, dubbed as ''safety alignment degradation'' inthis paper, and show that the challenge arises from the representation gap thatemerges when introducing vision modality to VLMs. In particular, we show thatthe representations of multi-modal inputs shift away from that of text-onlyinputs which represent the distribution that the LLM backbone is optimized for.At the same time, the safety alignment capabilities, initially developed withinthe textual embedding space, do not successfully transfer to this newmulti-modal representation space. To reduce safety alignment degradation, weintroduce Cross-Modality Representation Manipulation (CMRM), an inference timerepresentation intervention method for recovering the safety alignment abilitythat is inherent in the LLM backbone of VLMs, while simultaneously preservingthe functional capabilities of VLMs. The empirical results show that ourframework significantly recovers the alignment ability that is inherited fromthe LLM backbone with minimal impact on the fluency and linguistic capabilitiesof pre-trained VLMs even without additional training. Specifically, the unsaferate of LLaVA-7B on multi-modal input can be reduced from 61.53% to as low as3.15% with only inference-time intervention. WARNING: This paper contains examples of toxic or harmful language.

Quick Read (beta)

loading the full paper ...