Universal Adversarial Audio Perturbations

  • 2019-08-12 15:52:28
  • Sajjad Abdoli, Luiz G. Hafemann, Jerome Rony, Ismail Ben Ayed, Patrick Cardinal, Alessandro L. Koerich
  • 0

Abstract

We demonstrate the existence of universal adversarial perturbations, whichcan fool a family of audio processing architectures, for both targeted anduntargeted attacks. To the best of our knowledge, this is the first study ongenerating universal adversarial perturbations for audio processing systems. Wepropose two methods for finding such perturbations. The first method is basedon an iterative, greedy approach that is well-known in computer vision: itaggregates small perturbations to the input so as to push it to the decisionboundary. The second method, which is the main technical contribution of thiswork, is a novel penalty formulation, which finds targeted and untargeteduniversal adversarial perturbations. Differently from the greedy approach, thepenalty method minimizes an appropriate objective function on a batch ofsamples. Therefore, it produces more successful attacks when the number oftraining samples is limited. Moreover, we provide a proof that the proposedpenalty method theoretically converges to a solution that corresponds touniversal adversarial perturbations. We report comprehensive experiments,showing attack success rates higher than 91.1% and 74.7% for targeted anduntargeted attacks, respectively.

 

Quick Read (beta)

Universal adversarial audio perturbations
Supplementary material

Authors
Department of Computer Science
Cranberry-Lemon University
Pittsburgh, PA 15213
[email protected]
Use footnote for providing further information about author (webpage, alternative address)β€”not for acknowledging funding agencies.
\DeclareUnicodeCharacter

FB01fi

Β 

Universal adversarial audio perturbations
Supplementary material


Β  Anonymous Author(s) Affiliation Address email

\@float

noticebox[b]Submitted to 33rd Conference on Neural Information Processing Systems (NeurIPS 2019). Do not distribute.\[email protected]

1 Proof of Theorem 1

Theorem 1: Let {𝐯k}, k=1,…,∞, be the sequence generated by the proposed penalty method. Let 𝐯¯ be the limit point of {𝐯k}, then any limit point of the sequence is a solution to the original optimization problem:

min⁑d⁒B⁒(𝐯)Β s.t.Β yt=arg⁒maxy ⁒ℙ⁒(y|𝐱i+𝐯,ΞΈ)Β andΒ 0≀𝐱i+𝐯≀1β€ƒβˆ€π±i. (1)

Before proving the Theorem 1, a useful lemma is presented and proved.

Lemma 1: let 𝐯* be the optimal value of the original constrained problem defined in Eq. (1). Then d⁒B⁒(𝐯*)β‰₯L⁒(𝐰ik,𝐯k;t)β‰₯d⁒B⁒(𝐯k)β’βˆ€k.

Proof of Lemma 1:

d⁒B⁒(𝐯*)=d⁒B⁒(𝐯*)+c.G⁒(𝐰i*)(∡G(𝐰i*)=0)β‰₯d⁒B⁒(𝐯k)+c.G⁒(𝐰ik)(∡c>0,G(𝐰ik)β‰₯0,𝐰ikminimizesL(𝐰ik,𝐯k;t))β‰₯d⁒B⁒(𝐯k)∴d⁒B⁒(𝐯*)β‰₯L⁒(𝐰ik,𝐯k;t)β‰₯d⁒B⁒(𝐯k)β’βˆ€k

Proof of Theorem 1. d⁒B is a monotonically increasing function and continuous. Also, G is a hinge function, which is continuous. L is the summation of two continuous functions. Therefore, it is also a continuous function. The limit point of {𝐯k} is defined as: 𝐯¯=limkβ†’βˆžβ‘π―k and since function d⁒B is a continuous function, d⁒B⁒(𝐯¯)=limkβ†’βˆžβ‘d⁒B⁒(𝐯k). We can conclude that:

L*=limkβ†’βˆžL(𝐰ik,𝐯k;t)≀dB(𝐯*) (∡Lemma1)β‡’L*=limkβ†’βˆžβ‘d⁒B⁒(𝐯k)+limkβ†’βˆžβ‘c.G⁒(𝐰ik)≀d⁒B⁒(𝐯*)β‡’L*=d⁒B⁒(𝐯¯)+limkβ†’βˆžβ‘c.G⁒(𝐰ik)≀d⁒B⁒(𝐯*)

If 𝐯k is a feasible point for the constrained optimization problem defined in Eq. (1), then, from the definition of function G(.), one can conclude that limkβ†’βˆžβ‘c.G⁒(𝐰ik)=0. Then:

L*=d⁒B⁒(𝐯¯)≀d⁒B⁒(𝐯*)
∴𝐯¯⁒is a solution of the problem defined in Eq. (1)

2 Target models

In this study five types of models are targeted. For training all of the models categorical crossentropy is used as loss function and Adadelta Zeiler (2012) is used for optimizing the parameters of the models. In this section the complete description of the models is presented.

2.1 1D CNN Rand

Table 1 shows the configuration of 1D CNN Rand Abdoli et al. (2019). This model consists of 5 one dimensional convolutional layers. The number of the kernels of each convolutional layer is 16, 32, 64, 128 and 256. The size of the feature maps of each convolutional layer is 64, 32, 16, eight and four. The first, second and fifth convolutional layers are followed by a one dimensional max-pooling layer of size of eight, eight and four, respectively. The output of the second pooling layer is used as input to two Fully Connected (FC) layers on which a drop-out with probability of 0.5 is applied for both layers (Srivastava et al., 2014). Relu is used as the activation function for all of the layers. The number of the neurons of the FC layers are 128 and 64. In order to reduce the over-fitting, batch normalization is applied after the activation function of each convolution layer Ioffe and Szegedy (2015). The output of last fully connected layer is used as the input to a softmax layer with ten neurons for classification.

Layer Ksize Stride # of filters Data shape
InputLayer - - - (50,999, 1)
Conv1D 64 2 16 (25,468, 16)
MaxPooling1D 8 8 16 (3,183, 16)
Conv1D 32 2 32 (1,576, 32)
MaxPooling1D 8 8 32 (197, 32)
Conv1D 16 2 64 (91, 64)
Conv1D 8 2 128 (42, 128)
Conv1D 4 2 256 (20, 256)
MaxPooling1D 4 4 128 (5, 256)
FC - - 128 (128)
FC - - 64 (64)
FC - - 10 (10)
Table 1: 1D CNN Rand architecture.

2.2 1D CNN Gamma

This model is similar to 1D CNN Rand except a gammatone filter-bank is used for initialization of the filters of the first layer of this model Abdoli et al. (2019). Table 2 shows the configuration of this model. The filters of gammatone filter-bank is not trained during the backpropagation process. Sixty four filters are used to decompose the input signal into appropriate frequency bands. This filter-bank covers the frequency range between 100Hz to 8 kHz. After this layer, batch normalization is also applied Ioffe and Szegedy (2015).

Layer Ksize Stride # of filters Data shape
InputLayer - - - (50,999, 1)
Conv1D 512 1 64 (50,488, 64)
MaxPooling1D 8 8 64 (6,311, 64)
Conv1D 32 2 32 (3,140, 32)
MaxPooling1D 8 8 32 (392, 32)
Conv1D 16 2 64 (189, 64)
Conv1D 8 2 128 (91, 128)
Conv1D 4 2 256 (44, 256)
MaxPooling1D 4 4 128 (11, 256)
FC - - 128 (128)
FC - - 64 (64)
FC - - 10 (10)
Table 2: 1D CNN Gamma architecture

2.3 ENVnet-V2

Table 3 shows the architecture of ENVnet-V2 Tokozume et al. (2017). This model extracts short-time frequency features from audio file by using two one dimensional convolutional layers each with 32 and 64 filters followed by a one dimensional max-pooling layer. The model then swaps axes and convolve the features in time and frequency domain by the use of two two-dimensional convolutional layers each with 32 filters. After convolutional layers, a two dimensional max-pooling layer is used. After that, two other two dimensional convolutional layers followed by a max-pooling layer are used. After that, another two dimensional convolutional layer with 128 filters is used. After using two FC layers with 4096 neurons, a softmax layer is applied for classification. Drop-out with probability of 0.5 is also applied on FC layers Srivastava et al. (2014). Relu is also used as the activation function for all of the layers.

Layer Ksize Stride # of filters Data shape
InputLayer - - - (50,999, 1)
Conv1D 64 2 32 (25,468, 32)
Conv1D 16 2 64 (12,727, 64)
MaxPooling1D 64 64 64 (198, 64)
swapaxes - - - (198, 64, 1)
Conv2D (8,8) (1,1) 32 (191, 57, 32)
Conv2D (8,8) (1,1) 32 (184, 50, 32)
MaxPooling2D (5,3) (5,3) 32 (36, 16, 32)
Conv2D (1,4) (1,1) 64 (36, 16, 64)
Conv2D (1,4) (1,1) 64 (36, 10, 64)
MaxPooling2D (1,2) (1,2) 64 (36, 5, 64)
Conv2D (1,2) (1,1) 128 (36, 4, 128)
FC - - 4,096 (4,096)
FC - - 4,096 (4,096)
FC - - 10 (10)
Table 3: ENVnet-V2 architecture

2.4 SincNet

Table 4 shows the architecture of SincNet Ravanelli and Bengio (2018). In this model, 80 sinc functions are used as band-pass filters for decomposing the audio signal into appropriate frequency bands. After that, two one-dimenstional convolutional layers with 80 and 60 filters are applied. Layer normalization Lei Ba et al. (2016) is also used after each convolutional layer. After each covolutional layer, max-pooling is also used. Two FC layers followed by a softmax layer is used for classification. Drop-out with probability of 0.5 is also used on FC layers Srivastava et al. (2014). Batch normalization Ioffe and Szegedy (2015) is also used after FC layers. In this model, all hidden layers use leaky-ReLU Maas et al. (2013) non-linearities.

Layer Ksize Stride # of filters Data shape
InputLayer - - - (50,999, 1)
SincConv1D 251 1 80 (50,749, 80)
MaxPooling1D 3 1 80 (16,916, 80)
Conv1D 5 1 60 (16,912, 60)
MaxPooling1D 3 1 60 (5,637, 60)
Conv1D 5 1 60 (5,633, 60)
FC - - 128 (128)
FC - - 64 (64)
FC - - 10 (10)
Table 4: SincNet architecture

2.5 SincNet+VGG19

Table 5 shows the specification of this architecture. This model uses 227 Sinc filters to extract features from the raw audio signal as it is introduced in SincNet Ravanelli and Bengio (2018). After applying one-dimensional max-poolig layer of size of 218 with stride of one, and layer normalization Lei Ba et al. (2016), the output is stacked along time axis to form a 2D representation. This time-frequency representation is used as the input to a VGG19 Simonyan and Zisserman (2014) network followed by a FC layer and softmax layer for classification. The parameters of the VGG19 is the same as described in Simonyan and Zisserman (2014) and they are not changed in this study. The output of VGG19 is used as the input of a softmax layer with ten neurons for classification.

Layer Ksize Stride # of filters Data shape
InputLayer - - - (50,999, 1)
SincConv1D 251 1 227 (50,749, 1)
MaxPooling1D 218 1 227 (232, 1)
Reshape - - - (232, 227, 1)
VGG19 Simonyan and Zisserman (2014) - - - (7, 7, 512)
FC - - 10 (10)
Table 5: SincNet+VGG19 architecture

3 Audio examples

Several randomly chosen examples of perturbed audio samples of Urbansound8k dataset Salamon et al. (2014) are also presented. The audio samples are perturbed based on two presented methods in this study. Targeted and untargeted perturbations are considered. Table 6 shows a list of the samples. Methodology of crafting the samples, target models, and also detected class of the sample by each model as well as the true class of the samples are presented.

Sample Detected Class True Class Target Model Method Targeted/Untargeted
JA_0_org.wav jackhammer jackhammer SINCNet N/A N/A
JA_0_pert_pen.wav gun_shot jackhammer SINCNet penalty targeted
JA_0_pert_itr.wav gun_shot jackhammer SINCNet iterative targeted
SI_0_org.wav siren siren SINCNet N/A N/A
SI_0_pert_itr.wav car_horn siren SINCNet iterative targeted
SI_0_pert_pen.wav car_horn siren SINCNet penalty targeted
ST_0_org.wav street_music street_music SINCNet N/A N/A
ST_0_pert_pen.wav air_conditioner street_music SINCNet penalty targeted
ST_0_pert_itr.wav air_conditioner street_music SINCNet iterative targeted
DR_0_org.wav drilling drilling SINCNet N/A N/A
DR_0_pert_pen.wav siren drilling SINCNet penalty targeted
DR_0_pert_itr.wav siren drilling SINCNet iterative targeted
CA_0_org.wav car_horn car_horn SINCNet+VGG N/A N/A
CA_0_pert_itr.wav siren car_horn SINCNet+VGG iterative targeted
CA_0_pert_pen.wav siren car_horn SINCNet+VGG penalty targeted
JA_1_org.wav jackhammer jackhammer SINCNet+VGG N/A N/A
JA_1_pert_itr.wav dog_bark jackhammer SINCNet+VGG iterative untargeted
JA_1_pert_pen.wav children_playing jackhammer SINCNet+VGG penalty untargeted
EN_0_org.wav engine_idling engine_idling SINCNet+VGG N/A N/A
EN_0_pert_itr.wav drilling engine_idling SINCNet+VGG iterative untargeted
EN_0_pert_pen.wav drilling engine_idling SINCNet+VGG penalty untargeted
CA_1_org.wav car_horn car_horn SINCNet+VGG N/A N/A
CA_1_pert_pen.wav drilling car_horn SINCNet+VGG penalty untargeted
CA_1_pert_itr.wav drilling car_horn SINCNet+VGG iterative untargeted
SI_1_org.wav siren siren SINCNet+VGG N/A N/A
SI_1_pert_itr.wav street_music siren SINCNet+VGG iterative untargeted
SI_1_pert_pen.wav children_playing siren SINCNet+VGG penalty untargeted
Table 6: List of examples of perturbed audio samples, Methodology of crafting the samples, target models, and also detected class of the sample by each model and the true class of the samples. The audio files belong to UrbanSound8k dateset Salamon et al. (2014). N/A: Not Applicable

4 Detailed targeted attack results

Table 7 to table 11 show the detailed ASR on train set and test set on the target models in targeted attack scenario. For each specific target class of UrbanSound8k Salamon et al. (2014) ASRs are reported. Mean SNRs of the inputs to the models after adding universal perturbation are also reported. The target classes are: Air conditioner (AI), Car horn (CA), Children playing (CH), Dog bark (DO), Drilling (DR), Engine (EN) idling, Gun shot (GU), Jackhammer (JA), Siren (SI), Street music (ST).

Target Classes
Method AI CA CH DO DR EN GU JA SI ST
Iterative ASR train set 0.943 0.997 0.953 0.994 0.996 0.994 0.988 0.977 0.990 0.996
ASR test set 0.911 0.970 0.905 0.977 0.978 0.981 0.969 0.954 0.965 0.982
SNR (dB) test set 14.760 16.520 15.519 17.839 16.681 15.735 18.389 16.165 15.673 17.006
Penalty ASR train set 0.951 0.970 0.935 0.969 0.968 0.959 0.985 0.965 0.937 0.976
ASR test set 0.953 0.962 0.918 0.951 0.967 0.961 0.981 0.967 0.926 0.965
SNR (dB) test set 15.254 15.676 16.584 16.330 16.273 15.290 16.061 15.887 16.456 15.864
Table 7: ASR and mean SNR for targeting each label of UrbanSound8k Salamon et al. (2014) dataset. The target model is 1D CNN Rand.
Target Classes
Method AI CA CH DO DR EN GU JA SI ST
Iterative ASR train set 0.943 0.997 0.953 0.994 0.996 0.994 0.988 0.977 0.990 0.996
ASR test set 0.911 0.970 0.905 0.977 0.978 0.981 0.969 0.954 0.965 0.982
SNR (dB) test set 14.760 16.520 15.519 17.839 16.681 15.735 18.389 16.165 15.673 17.006
Penalty ASR train set 0.951 0.970 0.935 0.969 0.968 0.959 0.985 0.965 0.937 0.976
ASR test set 0.953 0.962 0.918 0.951 0.967 0.961 0.981 0.967 0.926 0.965
SNR (dB) test set 15.254 15.676 16.584 16.330 16.273 15.290 16.061 15.887 16.456 15.864
Table 8: ASR and mean SNR for targeting each label of UrbanSound8k Salamon et al. (2014) dataset. The target model is 1D CNN Gamma.
Target Classes
Method AI CA CH DO DR EN GU JA SI ST
Iterative ASR train set 0.992 0.977 0.980 0.993 0.975 0.993 0.979 0.979 0.991 0.982
ASR test set 0.977 0.960 0.965 0.971 0.950 0.969 0.954 0.963 0.974 0.937
SNR (dB) test set 18.373 17.374 17.791 18.450 17.492 17.989 18.321 17.953 17.896 18.192
Penalty ASR train set 0.964 0.964 0.975 0.977 0.981 0.963 0.977 0.950 0.990 0.971
ASR test set 0.938 0.935 0.960 0.960 0.962 0.947 0.963 0.910 0.983 0.962
SNR (dB) test set 18.327 16.645 18.529 16.135 15.985 17.291 15.672 17.257 16.844 17.219
Table 9: ASR and mean SNR for targeting each label of UrbanSound8k Salamon et al. (2014) dataset. The target model is ENVnet-V2.
Target Classes
Method AI CA CH DO DR EN GU JA SI ST
Iterative ASR train set 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
ASR test set 0.998 0.999 1.000 1.000 1.000 1.000 1.000 1.000 0.999 0.997
SNR (dB) test set 19.559 17.826 19.687 19.460 20.144 19.701 18.283 19.511 18.884 20.125
Penalty ASR train set 1.000 0.989 1.000 1.000 1.000 1.000 0.994 0.999 0.998 1.000
ASR test set 1.000 0.998 1.000 1.000 1.000 1.000 0.998 1.000 1.000 1.000
SNR (dB) test set 17.813 17.404 18.328 18.187 17.906 18.103 17.540 18.343 17.883 18.379
Table 10: ASR and mean SNR for targeting each label of UrbanSound8k Salamon et al. (2014) dataset. The target model is SincNet.
Target Classes
Method AI CA CH DO DR EN GU JA SI ST
Iterative ASR train set 0.991 0.998 0.998 0.998 0.997 0.952 0.982 1.000 0.996 0.994
ASR test set 0.975 0.987 0.987 0.986 0.978 0.928 0.957 0.981 0.986 0.969
SNR (dB) test set 18.354 19.296 19.297 19.217 20.755 17.498 18.048 19.683 19.096 19.592
Penalty ASR train set 0.960 0.965 0.974 0.900 0.982 0.906 0.950 0.968 0.931 0.916
ASR test set 0.959 0.961 0.958 0.896 0.989 0.903 0.939 0.961 0.931 0.913
SNR (dB) test set 16.968 18.293 18.049 18.448 18.373 16.270 17.037 18.103 17.733 17.819
Table 11: ASR and mean SNR for targeting each label of UrbanSound8k Salamon et al. (2014) dataset. The target model is SincNet+VGG.

References

  • S. Abdoli, P. Cardinal, and A. L. Koerich (2019) End-to-end environmental sound classification using a 1d convolutional neural network. arXiv preprint arXiv:1904.08990. Cited by: Β§2.1, Β§2.2.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: Β§2.1, Β§2.2, Β§2.4.
  • J. Lei Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: Β§2.4, Β§2.5.
  • A. L. Maas, A. Y. Hannun, and A. Y. Ng (2013) Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30, pp.Β 3. Cited by: Β§2.4.
  • M. Ravanelli and Y. Bengio (2018) Speaker recognition from raw waveform with sincnet. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp.Β 1021–1028. Cited by: Β§2.4, Β§2.5.
  • J. Salamon, C. Jacoby, and J.P. Bello (2014) A dataset and taxonomy for urban sound research. In 22nd ACM International Conference on Multimedia, New York, NY, USA, pp.Β 1041–1044. Cited by: Table 6, Β§3, Table 10, Table 11, Table 7, Table 8, Table 9, Β§4.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Β§2.5, Table 5.
  • N. Srivastava, G.E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting.. Journal of Machine Learning Research 15 (1), pp.Β 1929–1958. Cited by: Β§2.1, Β§2.3, Β§2.4.
  • Y. Tokozume, Y. Ushiku, and T. Harada (2017) Learning from between-class examples for deep sound recognition. arXiv preprint arXiv:1711.10282. Cited by: Β§2.3.
  • M. D. Zeiler (2012) ADADELTA: an adaptive learning rate method. External Links: 1212.5701 Cited by: Β§2.