Tune In, Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in Jailbreak

Abstract

Large Language Models (LLMs) demonstrate remarkable zero-shot performanceacross various natural language processing tasks. The integration of multimodalencoders extends their capabilities, enabling the development of MultimodalLarge Language Models that process vision, audio, and text. However, thesecapabilities also raise significant security concerns, as these models can bemanipulated to generate harmful or inappropriate content through jailbreak.While extensive research explores the impact of modality-specific input editson text-based LLMs and Large Vision-Language Models in jailbreak, the effectsof audio-specific edits on Large Audio-Language Models (LALMs) remainunderexplored. Hence, this paper addresses this gap by investigating howaudio-specific edits influence LALMs inference regarding jailbreak. Weintroduce the Audio Editing Toolbox (AET), which enables audio-modality editssuch as tone adjustment, word emphasis, and noise injection, and the EditedAudio Datasets (EADs), a comprehensive audio jailbreak benchmark. We alsoconduct extensive evaluations of state-of-the-art LALMs to assess theirrobustness under different audio edits. This work lays the groundwork forfuture explorations on audio-modality interactions in LALMs security.

Quick Read (beta)

loading the full paper ...