Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models

Abstract

Speech enabled foundation models, either in the form of flexible speechrecognition based systems or audio-prompted large language models (LLMs), arebecoming increasingly popular. One of the interesting aspects of these modelsis their ability to perform tasks other than automatic speech recognition (ASR)using an appropriate prompt. For example, the OpenAI Whisper model can performboth speech transcription and speech translation. With the development ofaudio-prompted LLMs there is the potential for even greater control options. Inthis work we demonstrate that with this greater flexibility the systems can besusceptible to model-control adversarial attacks. Without any access to themodel prompt it is possible to modify the behaviour of the system byappropriately changing the audio input. To illustrate this risk, we demonstratethat it is possible to prepend a short universal adversarial acoustic segmentto any input speech signal to override the prompt setting of an ASR foundationmodel. Specifically, we successfully use a universal adversarial acousticsegment to control Whisper to always perform speech translation, despite beingset to perform speech transcription. Overall, this work demonstrates a new formof adversarial attack on multi-tasking speech enabled foundation models thatneeds to be considered prior to the deployment of this form of model.

Quick Read (beta)

loading the full paper ...