Automating Steering for Safe Multimodal Large Language Models

Abstract

Recent progress in Multimodal Large Language Models (MLLMs) has unlockedpowerful cross-modal reasoning abilities, but also raised new safety concerns,particularly when faced with adversarial multimodal inputs. To improve thesafety of MLLMs during inference, we introduce a modular and adaptiveinference-time intervention technology, AutoSteer, without requiring anyfine-tuning of the underlying model. AutoSteer incorporates three corecomponents: (1) a novel Safety Awareness Score (SAS) that automaticallyidentifies the most safety-relevant distinctions among the model's internallayers; (2) an adaptive safety prober trained to estimate the likelihood oftoxic outputs from intermediate representations; and (3) a lightweight RefusalHead that selectively intervenes to modulate generation when safety risks aredetected. Experiments on LLaVA-OV and Chameleon across diverse safety-criticalbenchmarks demonstrate that AutoSteer significantly reduces the Attack SuccessRate (ASR) for textual, visual, and cross-modal threats, while maintaininggeneral abilities. These findings position AutoSteer as a practical,interpretable, and effective framework for safer deployment of multimodal AIsystems.

Quick Read (beta)

loading the full paper ...