Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security

Abstract

The rapid advancement of multimodal large language models (MLLMs) has led tobreakthroughs in various applications, yet their security remains a criticalchallenge. One pressing issue involves unsafe image-query pairs--jailbreakinputs specifically designed to bypass security constraints and elicitunintended responses from MLLMs. Compared to general multimodal data, suchunsafe inputs are relatively sparse, which limits the diversity and richness oftraining samples available for developing robust defense models. Meanwhile,existing guardrail-type methods rely on external modules to enforce securityconstraints but fail to address intrinsic vulnerabilities within MLLMs.Traditional supervised fine-tuning (SFT), on the other hand, often over-refusesharmless inputs, compromising general performance. Given these challenges, wepropose Secure Tug-of-War (SecTOW), an innovative iterative defense-attacktraining method to enhance the security of MLLMs. SecTOW consists of twomodules: a defender and an auxiliary attacker, both trained iteratively usingreinforcement learning (GRPO). During the iterative process, the attackeridentifies security vulnerabilities in the defense model and expands jailbreakdata. The expanded data are then used to train the defender, enabling it toaddress identified security vulnerabilities. We also design reward mechanismsused for GRPO to simplify the use of response labels, reducing dependence oncomplex generative labels and enabling the efficient use of synthetic data.Additionally, a quality monitoring mechanism is used to mitigate the defender'sover-refusal of harmless inputs and ensure the diversity of the jailbreak datagenerated by the attacker. Experimental results on safety-specific and generalbenchmarks demonstrate that SecTOW significantly improves security whilepreserving general performance.

Quick Read (beta)

loading the full paper ...