### Abstract

We demonstrate the existence of universal adversarial perturbations, whichcan fool a family of audio processing architectures, for both targeted anduntargeted attacks. To the best of our knowledge, this is the first study ongenerating universal adversarial perturbations for audio processing systems. Wepropose two methods for finding such perturbations. The first method is basedon an iterative, greedy approach that is well-known in computer vision: itaggregates small perturbations to the input so as to push it to the decisionboundary. The second method, which is the main technical contribution of thiswork, is a novel penalty formulation, which finds targeted and untargeteduniversal adversarial perturbations. Differently from the greedy approach, thepenalty method minimizes an appropriate objective function on a batch ofsamples. Therefore, it produces more successful attacks when the number oftraining samples is limited. Moreover, we provide a proof that the proposedpenalty method theoretically converges to a solution that corresponds touniversal adversarial perturbations. We report comprehensive experiments,showing attack success rates higher than 91.1% and 74.7% for targeted anduntargeted attacks, respectively.

### Quick Read (beta)

# Universal adversarial audio perturbations

Supplementary material

\DeclareUnicodeCharacter
FB01fi

Β

Universal adversarial audio perturbations

Supplementary material

Β Anonymous Author(s) Affiliation Address email

noticebox[b]Submitted to 33rd Conference on Neural Information Processing Systems (NeurIPS 2019). Do not distribute.\[email protected]

## 1 Proof of Theorem 1

Theorem 1: Let $\left\{{\mathrm{\pi \x9d\x90\u2015}}^{k}\right\}$, $k=1,\mathrm{\beta \x80\xa6},\mathrm{\beta \x88\x9e}$, be the sequence generated by the proposed penalty method. Let $\stackrel{{\rm B}\u2015}{\mathrm{\pi \x9d\x90\u2015}}$ be the limit point of $\left\{{\mathrm{\pi \x9d\x90\u2015}}^{k}\right\}$, then any limit point of the sequence is a solution to the original optimization problem:

$$\begin{array}{c}\mathrm{min}\beta \x81\u2018d\beta \x81\u2019B\beta \x81\u2019(\mathrm{\pi \x9d\x90\u2015})\hfill \\ \text{\Beta s.t.\Beta}{y}_{t}={\mathrm{arg}\beta \x81\u2019\mathrm{max}}_{y}\text{\Beta}\beta \x81\u2019\mathrm{\beta \x84\x99}\beta \x81\u2019\left(y|{\mathrm{\pi \x9d\x90\pm}}_{i}+\mathrm{\pi \x9d\x90\u2015},\mathrm{\Xi \u0388}\right)\hfill \\ \text{\Beta and\Beta}\hfill \\ 0\beta \x89\u20ac{\mathrm{\pi \x9d\x90\pm}}_{i}+\mathrm{\pi \x9d\x90\u2015}\beta \x89\u20ac1\mathit{\beta \x80\x83}\beta \x88\x80{\mathrm{\pi \x9d\x90\pm}}_{i}.\hfill \end{array}$$ | (1) |

Before proving the Theorem 1, a useful lemma is presented and proved.

Lemma 1: let ${\mathrm{\pi \x9d\x90\u2015}}^{*}$ be the optimal value of the original constrained problem defined in Eq. (1). Then $d\beta \x81\u2019B\beta \x81\u2019\left({\mathrm{\pi \x9d\x90\u2015}}^{*}\right)\beta \x89\u20afL\beta \x81\u2019({\mathrm{\pi \x9d\x90\xb0}}_{i}^{k},{\mathrm{\pi \x9d\x90\u2015}}^{k};t)\beta \x89\u20afd\beta \x81\u2019B\beta \x81\u2019\left({\mathrm{\pi \x9d\x90\u2015}}^{k}\right)\beta \x81\u2019\beta \x88\x80k$.

Proof of Lemma 1:

$$\begin{array}{cccc}d\beta \x81\u2019B\beta \x81\u2019({\mathrm{\pi \x9d\x90\u2015}}^{*})\hfill & =\hfill & d\beta \x81\u2019B\beta \x81\u2019({\mathrm{\pi \x9d\x90\u2015}}^{*})+c.G\beta \x81\u2019({\mathrm{\pi \x9d\x90\xb0}}_{i}^{*})\hfill & (\beta \x88\u0385G({\mathrm{\pi \x9d\x90\xb0}}_{i}^{*})=0)\hfill \\ & \beta \x89\u20af\hfill & d\beta \x81\u2019B\beta \x81\u2019({\mathrm{\pi \x9d\x90\u2015}}^{k})+c.G\beta \x81\u2019({\mathrm{\pi \x9d\x90\xb0}}_{i}^{k})\hfill & (\beta \x88\u0385c>0,G({\mathrm{\pi \x9d\x90\xb0}}_{i}^{k})\beta \x89\u20af0,{\mathrm{\pi \x9d\x90\xb0}}_{i}^{k}\text{minimizes}L({\mathrm{\pi \x9d\x90\xb0}}_{i}^{k},{\mathrm{\pi \x9d\x90\u2015}}^{k};t))\hfill \\ & \beta \x89\u20af\hfill & d\beta \x81\u2019B\beta \x81\u2019({\mathrm{\pi \x9d\x90\u2015}}^{k})\hfill & \\ \beta \x88\u0384d\beta \x81\u2019B\beta \x81\u2019({\mathrm{\pi \x9d\x90\u2015}}^{*})\hfill & \beta \x89\u20af\hfill & L\beta \x81\u2019({\mathrm{\pi \x9d\x90\xb0}}_{i}^{k},{\mathrm{\pi \x9d\x90\u2015}}^{k};t)\beta \x89\u20afd\beta \x81\u2019B\beta \x81\u2019({\mathrm{\pi \x9d\x90\u2015}}^{k})\beta \x81\u2019\beta \x88\x80k\hfill & \end{array}$$ |

Proof of Theorem 1. $d\beta \x81\u2019B$ is a monotonically increasing function and continuous. Also, $G$ is a hinge function, which is continuous. $L$ is the summation of two continuous functions. Therefore, it is also a continuous function. The limit point of $\left\{{\mathrm{\pi \x9d\x90\u2015}}^{k}\right\}$ is defined as: $\stackrel{{\rm B}\u2015}{\mathrm{\pi \x9d\x90\u2015}}={lim}_{k\beta \x86\x92\mathrm{\beta \x88\x9e}}\beta \x81\u2018{\mathrm{\pi \x9d\x90\u2015}}^{k}$ and since function $d\beta \x81\u2019B$ is a continuous function, $d\beta \x81\u2019B\beta \x81\u2019(\stackrel{{\rm B}\u2015}{\mathrm{\pi \x9d\x90\u2015}})={lim}_{k\beta \x86\x92\mathrm{\beta \x88\x9e}}\beta \x81\u2018d\beta \x81\u2019B\beta \x81\u2019({\mathrm{\pi \x9d\x90\u2015}}^{k})$. We can conclude that:

$$\begin{array}{ccccc}& {L}^{*}\hfill & =\hfill & \underset{k\beta \x86\x92\mathrm{\beta \x88\x9e}}{lim}L({\mathrm{\pi \x9d\x90\xb0}}_{i}^{k},{\mathrm{\pi \x9d\x90\u2015}}^{k};t)\beta \x89\u20acdB({\mathrm{\pi \x9d\x90\u2015}}^{*})\beta \x80\x83(\beta \x88\u0385Lemma1)\hfill & \\ \beta \x87\x92\hfill & {L}^{*}\hfill & =\hfill & \underset{k\beta \x86\x92\mathrm{\beta \x88\x9e}}{lim}\beta \x81\u2018d\beta \x81\u2019B\beta \x81\u2019({\mathrm{\pi \x9d\x90\u2015}}^{k})+\underset{k\beta \x86\x92\mathrm{\beta \x88\x9e}}{lim}\beta \x81\u2018c.G\beta \x81\u2019({\mathrm{\pi \x9d\x90\xb0}}_{i}^{k})\beta \x89\u20acd\beta \x81\u2019B\beta \x81\u2019({\mathrm{\pi \x9d\x90\u2015}}^{*})\hfill & \\ \beta \x87\x92\hfill & {L}^{*}\hfill & =\hfill & d\beta \x81\u2019B\beta \x81\u2019(\stackrel{{\rm B}\u2015}{\mathrm{\pi \x9d\x90\u2015}})+\underset{k\beta \x86\x92\mathrm{\beta \x88\x9e}}{lim}\beta \x81\u2018c.G\beta \x81\u2019({\mathrm{\pi \x9d\x90\xb0}}_{i}^{k})\beta \x89\u20acd\beta \x81\u2019B\beta \x81\u2019({\mathrm{\pi \x9d\x90\u2015}}^{*})\hfill & \end{array}$$ |

If ${\mathrm{\pi \x9d\x90\u2015}}^{k}$ is a feasible point for the constrained optimization problem defined in Eq. (1), then, from the definition of function $G(.)$, one can conclude that ${lim}_{k\beta \x86\x92\mathrm{\beta \x88\x9e}}\beta \x81\u2018c.G\beta \x81\u2019({\mathrm{\pi \x9d\x90\xb0}}_{i}^{k})=0$. Then:

$$\begin{array}{ccc}{L}^{*}\hfill & =\hfill & d\beta \x81\u2019B\beta \x81\u2019(\stackrel{{\rm B}\u2015}{\mathrm{\pi \x9d\x90\u2015}})\beta \x89\u20acd\beta \x81\u2019B\beta \x81\u2019({\mathrm{\pi \x9d\x90\u2015}}^{*})\hfill \end{array}$$ |

$$\overline{)\beta \x88\u0384\stackrel{{\rm B}\u2015}{\mathrm{\pi \x9d\x90\u2015}}\beta \x81\u2019\text{is a solution of the problem defined in Eq. (}\text{1}\text{)}}$$ |

## 2 Target models

In this study five types of models are targeted. For training all of the models categorical crossentropy is used as loss function and Adadelta Zeiler (2012) is used for optimizing the parameters of the models. In this section the complete description of the models is presented.

### 2.1 1D CNN Rand

Table 1 shows the configuration of 1D CNN Rand Abdoli et al. (2019). This model consists of 5 one dimensional convolutional layers. The number of the kernels of each convolutional layer is 16, 32, 64, 128 and 256. The size of the feature maps of each convolutional layer is 64, 32, 16, eight and four. The first, second and fifth convolutional layers are followed by a one dimensional max-pooling layer of size of eight, eight and four, respectively. The output of the second pooling layer is used as input to two Fully Connected (FC) layers on which a drop-out with probability of 0.5 is applied for both layers (Srivastava et al., 2014). Relu is used as the activation function for all of the layers. The number of the neurons of the FC layers are 128 and 64. In order to reduce the over-fitting, batch normalization is applied after the activation function of each convolution layer Ioffe and Szegedy (2015). The output of last fully connected layer is used as the input to a softmax layer with ten neurons for classification.

Layer | Ksize | Stride | # of filters | Data shape |

InputLayer | - | - | - | (50,999, 1) |

Conv1D | 64 | 2 | 16 | (25,468, 16) |

MaxPooling1D | 8 | 8 | 16 | (3,183, 16) |

Conv1D | 32 | 2 | 32 | (1,576, 32) |

MaxPooling1D | 8 | 8 | 32 | (197, 32) |

Conv1D | 16 | 2 | 64 | (91, 64) |

Conv1D | 8 | 2 | 128 | (42, 128) |

Conv1D | 4 | 2 | 256 | (20, 256) |

MaxPooling1D | 4 | 4 | 128 | (5, 256) |

FC | - | - | 128 | (128) |

FC | - | - | 64 | (64) |

FC | - | - | 10 | (10) |

### 2.2 1D CNN Gamma

This model is similar to 1D CNN Rand except a gammatone filter-bank is used for initialization of the filters of the first layer of this model Abdoli et al. (2019). Table 2 shows the configuration of this model. The filters of gammatone filter-bank is not trained during the backpropagation process. Sixty four filters are used to decompose the input signal into appropriate frequency bands. This filter-bank covers the frequency range between 100Hz to 8 kHz. After this layer, batch normalization is also applied Ioffe and Szegedy (2015).

Layer | Ksize | Stride | # of filters | Data shape |

InputLayer | - | - | - | (50,999, 1) |

Conv1D | 512 | 1 | 64 | (50,488, 64) |

MaxPooling1D | 8 | 8 | 64 | (6,311, 64) |

Conv1D | 32 | 2 | 32 | (3,140, 32) |

MaxPooling1D | 8 | 8 | 32 | (392, 32) |

Conv1D | 16 | 2 | 64 | (189, 64) |

Conv1D | 8 | 2 | 128 | (91, 128) |

Conv1D | 4 | 2 | 256 | (44, 256) |

MaxPooling1D | 4 | 4 | 128 | (11, 256) |

FC | - | - | 128 | (128) |

FC | - | - | 64 | (64) |

FC | - | - | 10 | (10) |

### 2.3 ENVnet-V2

Table 3 shows the architecture of ENVnet-V2 Tokozume et al. (2017). This model extracts short-time frequency features from audio file by using two one dimensional convolutional layers each with 32 and 64 filters followed by a one dimensional max-pooling layer. The model then swaps axes and convolve the features in time and frequency domain by the use of two two-dimensional convolutional layers each with 32 filters. After convolutional layers, a two dimensional max-pooling layer is used. After that, two other two dimensional convolutional layers followed by a max-pooling layer are used. After that, another two dimensional convolutional layer with 128 filters is used. After using two FC layers with 4096 neurons, a softmax layer is applied for classification. Drop-out with probability of 0.5 is also applied on FC layers Srivastava et al. (2014). Relu is also used as the activation function for all of the layers.

Layer | Ksize | Stride | # of filters | Data shape |

InputLayer | - | - | - | (50,999, 1) |

Conv1D | 64 | 2 | 32 | (25,468, 32) |

Conv1D | 16 | 2 | 64 | (12,727, 64) |

MaxPooling1D | 64 | 64 | 64 | (198, 64) |

swapaxes | - | - | - | (198, 64, 1) |

Conv2D | (8,8) | (1,1) | 32 | (191, 57, 32) |

Conv2D | (8,8) | (1,1) | 32 | (184, 50, 32) |

MaxPooling2D | (5,3) | (5,3) | 32 | (36, 16, 32) |

Conv2D | (1,4) | (1,1) | 64 | (36, 16, 64) |

Conv2D | (1,4) | (1,1) | 64 | (36, 10, 64) |

MaxPooling2D | (1,2) | (1,2) | 64 | (36, 5, 64) |

Conv2D | (1,2) | (1,1) | 128 | (36, 4, 128) |

FC | - | - | 4,096 | (4,096) |

FC | - | - | 4,096 | (4,096) |

FC | - | - | 10 | (10) |

### 2.4 SincNet

Table 4 shows the architecture of SincNet Ravanelli and Bengio (2018). In this model, 80 sinc functions are used as band-pass filters for decomposing the audio signal into appropriate frequency bands. After that, two one-dimenstional convolutional layers with 80 and 60 filters are applied. Layer normalization Lei Ba et al. (2016) is also used after each convolutional layer. After each covolutional layer, max-pooling is also used. Two FC layers followed by a softmax layer is used for classification. Drop-out with probability of 0.5 is also used on FC layers Srivastava et al. (2014). Batch normalization Ioffe and Szegedy (2015) is also used after FC layers. In this model, all hidden layers use leaky-ReLU Maas et al. (2013) non-linearities.

Layer | Ksize | Stride | # of filters | Data shape |

InputLayer | - | - | - | (50,999, 1) |

SincConv1D | 251 | 1 | 80 | (50,749, 80) |

MaxPooling1D | 3 | 1 | 80 | (16,916, 80) |

Conv1D | 5 | 1 | 60 | (16,912, 60) |

MaxPooling1D | 3 | 1 | 60 | (5,637, 60) |

Conv1D | 5 | 1 | 60 | (5,633, 60) |

FC | - | - | 128 | (128) |

FC | - | - | 64 | (64) |

FC | - | - | 10 | (10) |

### 2.5 SincNet+VGG19

Table 5 shows the specification of this architecture. This model uses 227 Sinc filters to extract features from the raw audio signal as it is introduced in SincNet Ravanelli and Bengio (2018). After applying one-dimensional max-poolig layer of size of 218 with stride of one, and layer normalization Lei Ba et al. (2016), the output is stacked along time axis to form a 2D representation. This time-frequency representation is used as the input to a VGG19 Simonyan and Zisserman (2014) network followed by a FC layer and softmax layer for classification. The parameters of the VGG19 is the same as described in Simonyan and Zisserman (2014) and they are not changed in this study. The output of VGG19 is used as the input of a softmax layer with ten neurons for classification.

Layer | Ksize | Stride | # of filters | Data shape |

InputLayer | - | - | - | (50,999, 1) |

SincConv1D | 251 | 1 | 227 | (50,749, 1) |

MaxPooling1D | 218 | 1 | 227 | (232, 1) |

Reshape | - | - | - | (232, 227, 1) |

VGG19 Simonyan and Zisserman (2014) | - | - | - | (7, 7, 512) |

FC | - | - | 10 | (10) |

## 3 Audio examples

Several randomly chosen examples of perturbed audio samples of Urbansound8k dataset Salamon et al. (2014) are also presented. The audio samples are perturbed based on two presented methods in this study. Targeted and untargeted perturbations are considered. Table 6 shows a list of the samples. Methodology of crafting the samples, target models, and also detected class of the sample by each model as well as the true class of the samples are presented.

Sample | Detected Class | True Class | Target Model | Method | Targeted/Untargeted |
---|---|---|---|---|---|

JA_0_org.wav | jackhammer | jackhammer | SINCNet | N/A | N/A |

JA_0_pert_pen.wav | gun_shot | jackhammer | SINCNet | penalty | targeted |

JA_0_pert_itr.wav | gun_shot | jackhammer | SINCNet | iterative | targeted |

SI_0_org.wav | siren | siren | SINCNet | N/A | N/A |

SI_0_pert_itr.wav | car_horn | siren | SINCNet | iterative | targeted |

SI_0_pert_pen.wav | car_horn | siren | SINCNet | penalty | targeted |

ST_0_org.wav | street_music | street_music | SINCNet | N/A | N/A |

ST_0_pert_pen.wav | air_conditioner | street_music | SINCNet | penalty | targeted |

ST_0_pert_itr.wav | air_conditioner | street_music | SINCNet | iterative | targeted |

DR_0_org.wav | drilling | drilling | SINCNet | N/A | N/A |

DR_0_pert_pen.wav | siren | drilling | SINCNet | penalty | targeted |

DR_0_pert_itr.wav | siren | drilling | SINCNet | iterative | targeted |

CA_0_org.wav | car_horn | car_horn | SINCNet+VGG | N/A | N/A |

CA_0_pert_itr.wav | siren | car_horn | SINCNet+VGG | iterative | targeted |

CA_0_pert_pen.wav | siren | car_horn | SINCNet+VGG | penalty | targeted |

JA_1_org.wav | jackhammer | jackhammer | SINCNet+VGG | N/A | N/A |

JA_1_pert_itr.wav | dog_bark | jackhammer | SINCNet+VGG | iterative | untargeted |

JA_1_pert_pen.wav | children_playing | jackhammer | SINCNet+VGG | penalty | untargeted |

EN_0_org.wav | engine_idling | engine_idling | SINCNet+VGG | N/A | N/A |

EN_0_pert_itr.wav | drilling | engine_idling | SINCNet+VGG | iterative | untargeted |

EN_0_pert_pen.wav | drilling | engine_idling | SINCNet+VGG | penalty | untargeted |

CA_1_org.wav | car_horn | car_horn | SINCNet+VGG | N/A | N/A |

CA_1_pert_pen.wav | drilling | car_horn | SINCNet+VGG | penalty | untargeted |

CA_1_pert_itr.wav | drilling | car_horn | SINCNet+VGG | iterative | untargeted |

SI_1_org.wav | siren | siren | SINCNet+VGG | N/A | N/A |

SI_1_pert_itr.wav | street_music | siren | SINCNet+VGG | iterative | untargeted |

SI_1_pert_pen.wav | children_playing | siren | SINCNet+VGG | penalty | untargeted |

## 4 Detailed targeted attack results

Table 7 to table 11 show the detailed ASR on train set and test set on the target models in targeted attack scenario. For each specific target class of UrbanSound8k Salamon et al. (2014) ASRs are reported. Mean SNRs of the inputs to the models after adding universal perturbation are also reported. The target classes are: Air conditioner (AI), Car horn (CA), Children playing (CH), Dog bark (DO), Drilling (DR), Engine (EN) idling, Gun shot (GU), Jackhammer (JA), Siren (SI), Street music (ST).

Target Classes | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Method | AI | CA | CH | DO | DR | EN | GU | JA | SI | ST | |

Iterative | ASR train set | 0.943 | 0.997 | 0.953 | 0.994 | 0.996 | 0.994 | 0.988 | 0.977 | 0.990 | 0.996 |

ASR test set | 0.911 | 0.970 | 0.905 | 0.977 | 0.978 | 0.981 | 0.969 | 0.954 | 0.965 | 0.982 | |

SNR (dB) test set | 14.760 | 16.520 | 15.519 | 17.839 | 16.681 | 15.735 | 18.389 | 16.165 | 15.673 | 17.006 | |

Penalty | ASR train set | 0.951 | 0.970 | 0.935 | 0.969 | 0.968 | 0.959 | 0.985 | 0.965 | 0.937 | 0.976 |

ASR test set | 0.953 | 0.962 | 0.918 | 0.951 | 0.967 | 0.961 | 0.981 | 0.967 | 0.926 | 0.965 | |

SNR (dB) test set | 15.254 | 15.676 | 16.584 | 16.330 | 16.273 | 15.290 | 16.061 | 15.887 | 16.456 | 15.864 |

Target Classes | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Method | AI | CA | CH | DO | DR | EN | GU | JA | SI | ST | |

Iterative | ASR train set | 0.943 | 0.997 | 0.953 | 0.994 | 0.996 | 0.994 | 0.988 | 0.977 | 0.990 | 0.996 |

ASR test set | 0.911 | 0.970 | 0.905 | 0.977 | 0.978 | 0.981 | 0.969 | 0.954 | 0.965 | 0.982 | |

SNR (dB) test set | 14.760 | 16.520 | 15.519 | 17.839 | 16.681 | 15.735 | 18.389 | 16.165 | 15.673 | 17.006 | |

Penalty | ASR train set | 0.951 | 0.970 | 0.935 | 0.969 | 0.968 | 0.959 | 0.985 | 0.965 | 0.937 | 0.976 |

ASR test set | 0.953 | 0.962 | 0.918 | 0.951 | 0.967 | 0.961 | 0.981 | 0.967 | 0.926 | 0.965 | |

SNR (dB) test set | 15.254 | 15.676 | 16.584 | 16.330 | 16.273 | 15.290 | 16.061 | 15.887 | 16.456 | 15.864 |

Target Classes | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Method | AI | CA | CH | DO | DR | EN | GU | JA | SI | ST | |

Iterative | ASR train set | 0.992 | 0.977 | 0.980 | 0.993 | 0.975 | 0.993 | 0.979 | 0.979 | 0.991 | 0.982 |

ASR test set | 0.977 | 0.960 | 0.965 | 0.971 | 0.950 | 0.969 | 0.954 | 0.963 | 0.974 | 0.937 | |

SNR (dB) test set | 18.373 | 17.374 | 17.791 | 18.450 | 17.492 | 17.989 | 18.321 | 17.953 | 17.896 | 18.192 | |

Penalty | ASR train set | 0.964 | 0.964 | 0.975 | 0.977 | 0.981 | 0.963 | 0.977 | 0.950 | 0.990 | 0.971 |

ASR test set | 0.938 | 0.935 | 0.960 | 0.960 | 0.962 | 0.947 | 0.963 | 0.910 | 0.983 | 0.962 | |

SNR (dB) test set | 18.327 | 16.645 | 18.529 | 16.135 | 15.985 | 17.291 | 15.672 | 17.257 | 16.844 | 17.219 |

Target Classes | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Method | AI | CA | CH | DO | DR | EN | GU | JA | SI | ST | |

Iterative | ASR train set | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |

ASR test set | 0.998 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.999 | 0.997 | |

SNR (dB) test set | 19.559 | 17.826 | 19.687 | 19.460 | 20.144 | 19.701 | 18.283 | 19.511 | 18.884 | 20.125 | |

Penalty | ASR train set | 1.000 | 0.989 | 1.000 | 1.000 | 1.000 | 1.000 | 0.994 | 0.999 | 0.998 | 1.000 |

ASR test set | 1.000 | 0.998 | 1.000 | 1.000 | 1.000 | 1.000 | 0.998 | 1.000 | 1.000 | 1.000 | |

SNR (dB) test set | 17.813 | 17.404 | 18.328 | 18.187 | 17.906 | 18.103 | 17.540 | 18.343 | 17.883 | 18.379 |

Target Classes | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Method | AI | CA | CH | DO | DR | EN | GU | JA | SI | ST | |

Iterative | ASR train set | 0.991 | 0.998 | 0.998 | 0.998 | 0.997 | 0.952 | 0.982 | 1.000 | 0.996 | 0.994 |

ASR test set | 0.975 | 0.987 | 0.987 | 0.986 | 0.978 | 0.928 | 0.957 | 0.981 | 0.986 | 0.969 | |

SNR (dB) test set | 18.354 | 19.296 | 19.297 | 19.217 | 20.755 | 17.498 | 18.048 | 19.683 | 19.096 | 19.592 | |

Penalty | ASR train set | 0.960 | 0.965 | 0.974 | 0.900 | 0.982 | 0.906 | 0.950 | 0.968 | 0.931 | 0.916 |

ASR test set | 0.959 | 0.961 | 0.958 | 0.896 | 0.989 | 0.903 | 0.939 | 0.961 | 0.931 | 0.913 | |

SNR (dB) test set | 16.968 | 18.293 | 18.049 | 18.448 | 18.373 | 16.270 | 17.037 | 18.103 | 17.733 | 17.819 |

## References

- End-to-end environmental sound classification using a 1d convolutional neural network. arXiv preprint arXiv:1904.08990. Cited by: Β§2.1, Β§2.2.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: Β§2.1, Β§2.2, Β§2.4.
- Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: Β§2.4, Β§2.5.
- Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30, pp.Β 3. Cited by: Β§2.4.
- Speaker recognition from raw waveform with sincnet. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp.Β 1021β1028. Cited by: Β§2.4, Β§2.5.
- A dataset and taxonomy for urban sound research. In 22nd ACM International Conference on Multimedia, New York, NY, USA, pp.Β 1041β1044. Cited by: Table 6, Β§3, Table 10, Table 11, Table 7, Table 8, Table 9, Β§4.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Β§2.5, Table 5.
- Dropout: a simple way to prevent neural networks from overfitting.. Journal of Machine Learning Research 15 (1), pp.Β 1929β1958. Cited by: Β§2.1, Β§2.3, Β§2.4.
- Learning from between-class examples for deep sound recognition. arXiv preprint arXiv:1711.10282. Cited by: Β§2.3.
- ADADELTA: an adaptive learning rate method. External Links: 1212.5701 Cited by: Β§2.