MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance

Abstract

The deployment of multimodal large language models (MLLMs) has brought fortha unique vulnerability: susceptibility to malicious attacks through visualinputs. We delve into the novel challenge of defending MLLMs against suchattacks. We discovered that images act as a "foreign language" that is notconsidered during alignment, which can make MLLMs prone to producing harmfulresponses. Unfortunately, unlike the discrete tokens considered in text-basedLLMs, the continuous nature of image signals presents significant alignmentchallenges, which poses difficulty to thoroughly cover the possible scenarios.This vulnerability is exacerbated by the fact that open-source MLLMs arepredominantly fine-tuned on limited image-text pairs that is much less than theextensive text-based pretraining corpus, which makes the MLLMs more prone tocatastrophic forgetting of their original abilities during explicit alignmenttuning. To tackle these challenges, we introduce MLLM-Protector, aplug-and-play strategy combining a lightweight harm detector and a responsedetoxifier. The harm detector's role is to identify potentially harmful outputsfrom the MLLM, while the detoxifier corrects these outputs to ensure theresponse stipulates to the safety standards. This approach effectivelymitigates the risks posed by malicious visual inputs without compromising themodel's overall performance. Our results demonstrate that MLLM-Protector offersa robust solution to a previously unaddressed aspect of MLLM security.

Quick Read (beta)

loading the full paper ...