Specialized Decision Surface and Disentangled Feature for Weakly-Supervised Polyphonic Sound Event Detection

  • 2019-07-10 16:41:13
  • Liwei Lin, Xiangdong Wang, Hong Liu, Yueliang Qian
  • 0

Abstract

Sound event detection (SED) is to recognize the presence of sound events inthe segment of audio and detect their onset as well as offset. SED can beregarded as a supervised learning task when strong annotations (timestamps) areavailable during learning. However, due to the high cost of manual stronglabeling data, it becomes crucial to introduce weakly supervised learning toSED, in which only weak annotations (clip-level annotations without timestamps)are available during learning. In this paper, we approach SED as a multipleinstance learning (MIL) problem and utilize a neural network framework with anembedding-level pooling module to solve it. The pooling module, whichaggregates a sequence of high-level features generated by the neural networkfeature encoder into a single contextual feature representation, enables themodel to learn with only weak annotations. We explore the self-learning abilityof different pooling modules on finer information and propose a specializeddecision surface (SDS) for class-wise attention pooling (cATP) module. Weanalyze and explained why a cATP module with SDS is better than other typicalpooling modules from the perspective of feature space. According to theco-occurrence of several categories in the multi-label classification task, wealso propose a disentangled feature (DF) to reduce interference betweencategories, which optimizes the high-level feature space by disentangling itbased on class-wise identifiable information in the training set and obtainingmultiple different subspaces. Experiments show that our approach achievesstate-of-art performance on Task4 of the DCASE2018 challenge.

 

Quick Read (beta)

Specialized Decision Surface and Disentangled Feature for Weakly-Supervised Polyphonic Sound Event Detection

Liwei Lin1,2, Xiangdong Wang1, Hong Liu1, and Yueliang Qian1 1Bejing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China2University of Chinese Academy of Sciences, Beijing, China
Abstract

Sound event detection (SED) is to recognize the presence of sound events in the segment of audio and detect their onset as well as offset. SED can be regarded as a supervised learning task when strong annotations (timestamps) are available during learning. However, due to the high cost of manual strong labeling data, it becomes crucial to introduce weakly supervised learning to SED, in which only weak annotations (clip-level annotations without timestamps) are available during learning.

In this paper, we approach SED as a multiple instance learning (MIL) problem and utilize a neural network framework with an embedding-level pooling module to solve it. The pooling module, which aggregates a sequence of high-level features generated by the neural network feature encoder into a single contextual feature representation, enables the model to learn with only weak annotations. We explore the self-learning ability of different pooling modules on finer information and propose a specialized decision surface (SDS) for class-wise attention pooling (cATP) module. We analyze and explained why a cATP module with SDS is better than other typical pooling modules from the perspective of feature space. According to the co-occurrence of several categories in the multi-label classification task, we also propose a disentangled feature (DF) to reduce interference between categories, which optimizes the high-level feature space by disentangling it based on class-wise identifiable information in the training set and obtaining multiple different subspaces. Experiments show that our approach achieves state-of-art performance on Task4 of the DCASE2018 challenge. On this basis,

Sound event detection, machine learning, weakly-supervised learning, attention pooling.

I Introduction

Sound event detection (SED) is the task to detect and recognize individual sound sources in realistic soundscapes. It is required to recognize not only the presence of each event category in a sound source but also the start and end boundaries of each existing event. Annotations with such timestamps for all occurrences are termed as strong annotations, while weak annotations only indicate the presence of event categories. Due to the difficulty in obtaining large-scale strongly annotated training data, weakly supervised learning with only weakly annotated training data has become a new focus in research on SED.

Weakly supervised learning is often approached as a MIL problem [1], [2]. The excellent performance of neural networks in various fields promotes the combination of the MIL framework and neural networks for weakly supervised learning [3], [4], [5]. It is especially common in medical imaging [6], [7] and semantic segmentation [8], [9]. Tseng et al. [10] proposed a small-footprint MIL framework for multi-class audio event detection (AED), which treats segments in an audio clip as a bag of instances and utilizes global max pooling (GMP) to integrate them. In addition to GMP, global average pooling (GAP) [11], [12], noisy-or pooling [13] and attention pooling [14], [15] are also in common use.

The procedure of the MIL approach with neural networks for SED embraces three stages:

  1. 1.

    Encode audio features to a high-level representation (or the frame-level probabilities) by means of neural networks.

  2. 2.

    Aggregate all the high-level representations (or the frame-level probabilities) into a contextual representation (or clip-level probability) via a pooling module.

  3. 3.

    Pass the contextual representation into a classifier to obtain the clip-level prediction (or take the clip-level probability as the clip-level prediction directly when utilizing the frame-level probabilities in stage 2).

The approaches are distinguished as instance-level approaches and embedding-level approaches according to whether the high-level representation or the frame-level probabilities are employed before aggregation in the second part above. The instance-level approaches add the pooling layer to the frame-level probabilities to obtain the clip-level probability directly [16], [13]. The embedding-level approaches add the pooling layer to the high-level representation generated by neural networks to obtain a contextual representation, and the clip-level probability is further acquired [15]. As mentioned in [14] and [17], the embedding-level approaches are preferable in terms of the clip-level performance, which is demonstrated on Audioset [18], [19].

I-A Our contributions

In this paper, we approach SED as a multiple instance learning (MIL) problem, and utilize a neural network framework with a embedding-level class-wise attention pooling (cATP) module to solve it. We explain why a pooling module is able to learn the potential instant-level decision surface when learns the bag-level decision surface from the high-level feature space. Compared to other pooling modules such as GMP and GAP, frame-level (instant-level) features and clip-level (bag-level) features of which share the same decision surface, cATP introduces a separate decision surface for the frame-level feature. We term it as specialized decision surface (SDS) and demonstrate it is more conducive to frame-level feature classification than the common sharing decision surface mentioned above.

Furthermore, we propose a disentangled feature (DF) to help multi-class learning. Due to the unbalanced data set and the fact that multiple categories always occur in concert at clip-level, it is difficult for cATP to learn multiple separate feature subspaces for those event categories that are highly overlapped with other event categories, especially for those with relatively few occurrences. Therefore, by taking into account the category overlap information, we propose a disentangled feature which re-models the high-level feature space so that the feature subspace of a certain category differs from other categories without pre-training. The scales of these disentangled feature subspaces depend on how many available clips containing strong class-wise identifiable information with less interference from other categories. In virtue of the introduction of more class-wise prior information as well as the network redundancy weight reduction, the disentangled feature can be regarded as a regularization method to help improve the performance of cATP-MIL frameworks.

Our experiments showed that cATP with SDS and DF outperforms other pooling modules and simple cATP. Detailed analysis of the high-level feature space also supported our hypothesis.

II Related work

II-A Multiple instance learning

According to MIL, all the samples with global annotations are treated as bags of instances. If there is at least one positive instance in a bag, the bag is labeled as a positive bag. Since neural networks have been widely used as a general high-level feature extractor in various tasks, the application of MIL based on neural network typically focuses on adding a pooling module to the highest layer of the network [4], [20], which aggregates multiple output scores for all the instances into a bag-level score so that the model is able to calculate the loss using only a weak annotation.

For the multi-class classification, as shown in Figure 1, let 𝐱={x1,,xT} be the high-level feature sequence of the audio cilp generated by feature encoder and 𝐲={y1,,yC} (yc{0,1}) be the groundtruths, where C is the number of categories.

Consider the audio cilp as a bag with several instances (frames). According to MIL, the audio cilp is marked with a positive label when there is at least one positive frame in the audio cilp. When learning, different pooling modules indicate different strategies for associating the annotation of the audio cilp with that of those frames.

Instant-level pooling modules aggregates frame-level predictions into a cilp-level prediction:

𝐏^(𝐲𝐱)=𝐏𝐎𝐎𝐋𝐈𝐍𝐆(𝐏^(𝐲x1),,𝐏^(𝐲xT)) (1)

Embedding-level pooling modules aggregates high-level feature sequence into a contextual representation 𝐡:

𝐡=𝐏𝐎𝐎𝐋𝐈𝐍𝐆(x1,,xT) (2)
𝐏^(𝐲𝐱)=𝐏^(𝐲𝐡) (3)
Fig. 1: The comparison of embedding-level module and instance-level pooling module in MIL framework.

When making predictions, assuming α is a threshold both for clip-level prediction and frame-level prediction. Then the clip-level prediction for event category c is:

ϕ𝐜(𝐱)={yc,𝐏^(yc𝐱)α1-yc,otherwise (4)

The frame-level prediction for event c at time t is:

φ𝐜(𝐱,t)={yc,𝐏^(ycxt)ϕ𝐜(𝐱)α1-yc,otherwise (5)

As discussion in Section I, the embedding-level MIL are preferable in terms of the clip-level performance, we only discuss the embedding-level MIL in the remaining sections.

II-B Pooling modules

In this section, we introduce 3 different typical pooling modules including global max pooling (GMP), global average pooling (GAP) and global softmax pooling (GSP) for the embedding-level MIL.

For GMP and GAP, assuming the contextual representation 𝐡={h1,,hM} is a M dimensional vector, then the mth component of 𝐡 for GMP is,

hm=max𝑡xtm (6)

And the mth component of 𝐡 for GAP is,

hm=1T𝑡xtm (7)
(a) GMP
(b) GAP
(c) GSP
(d) cATP
Fig. 2: A sketch of the process of how 𝐱 and 𝐡 forms. Red circles represent positive instances, blue circles represent negative instances and the green one represents the contextual representation. In 2(c) and 2(d), when selecting instances to update, we just ignore the ones with relatively small weights.

Obviously, both GMP and GAP generate a contextual representation with taking account of the information of the whole sequence. However, one of the problems among them is that they don’t take into account the different contributions of 𝐱 to the contextual representation 𝐡 at different times in different categories.

GSP fixes this defect by generating C different contextual representations 𝐡={h1,,hC}, then,

hc=tactxt (8)
act=exp(ψ(𝐏^(ycxt)))kexp(ψ(𝐏^(ycxk))) (9)

where ψ is a function to scale 𝐏^(ycxt) appropriately.

Therefore, GSP attempts to separate clip-level embeddings of C different categories to C different subspaces, enabling the model to be more flexible in self-learning.

III Methods

III-A Specialized decision surface

According to Equation 5, all the pooling modules introduced in Section II-B default that not only the model learns the decision surface explicitly for 𝐡 but also learn the potential decision surface for 𝐱, and this potential decision surface gradually approaches the decision surface of 𝐡.

As shown in Figure 2, when 𝐱 and 𝐡 are depicted in the same feature space, we intuitively show how the pooling module performs weakly supervised learning. For example, GMP tends to update the boundaries of 𝐱 while GAP tends to update 𝐱 in a positive clip forward toward the positive decision surface ignoring mistakes caused by several negative frames in the clip. The mistake made by GAP can be eased when 𝐱 in a negative clip is updated. Similarly, according to Equation 9, when updating 𝐱, GSP tends to update those frames that fall on (or near) one side of the decision surface forward toward the same direction. This is a tradeoff between bulk updates and fewer mistakes as well as a tradeoff between GMP and GAP, for which GSP is expected to performs better than both GMP and GAP. Simultaneously, the feature space formed by these update strategies finally enables 𝐡 and 𝐱 to share the same decision surface, although this sharing decision surface seems to be kind of mismatch when it comes to the classification of the later.

On the basis of these observations, we proposed an explanation for the superiority of class-wise attention pooling (cATP) module.

Similarly to GSP, cATP also employs weighted factors to distinguish how importance xt to hc for each category c, and the weighted factor act depends on a trainable vector wcT and a bias bc instead of 𝐏^(ycxt), then,

act=exp((wcTxt+bc)/d)kexp((wcTxk+bc)/d) (10)

where d a scaling factor to prevent the dot products from growing so large in magnitude that pushing the softmax function into regions where it has extremely small gradients.

We argue that free parameters wcT and bc determine a specialized decision surface (SDS) to select important frames to update. 𝐱 is grouped into two clusters in its feature space and SDS is exactly the decision surface formed in this unsupervised process.

As shown in Figure 2(d), compared with the sharing decision surface of GSP, SDS of cATP allows more flexible selection of positive (or negative) frames. As a result, fewer mistakes are made when updating, leading to better performance on classification.

We also note that SDS is not only better at determining the importance of frames but also more suitable to be the decision surface of 𝐱 than the sharing decision surface using in MIL.

From this point on, the frame-level prediction for event c at time t is:

φ𝐜(𝐱,t)={yc,p^(ycxt)ϕ𝐜(𝐱)α1-yc,otherwise (11)
p^(ycxt)=σ(wcTxk+bc) (12)

where σ is Sigmoid function.

III-B Disentangled feature

cATP with SDS explicitly separate 𝐡 and 𝐱 to two subspaces, namely, learning feature distribution and decision surface in the former subspace with supervision and in the later subspace without supervision. When it comes to multi-label classification, the fact that a certain category always occurs in co-occurrence with other categories makes it difficult to differentiate the feature of this category which distributed in the former subspace from other categories. This effect will be exacerbated when the number of clips with much identifiable information of certain categories in the unbalanced set is particularly small.

To mitigate this effect, we propose a disentangled feature (DF) to re-model multiple feature subspaces for 𝐱 by selecting specific bases for the feature space of each event category. Since 𝐡 is produced by 𝐱 according to Equation 8, the feature space of 𝐡 is also re-modeled into C feature subspaces.

Assuming that χ𝐝 (𝐱χ𝐝) is a d-dimensional space generated by the feature encoder and ß={e1,e2,,ed} is a basis of χ𝐝. We define χc, a subspace of χ𝐝, as the feature space of event category c, then the basis of χc is

ßc={e1,e2,,ekc} (13)

where 𝐤={k1,k2,,kC}(0<kcd) relates to how large the scale of the clips containing less interference available during training is.

In this way, the diversity of elements in 𝐤 leads to the feature space of each category to be remodeled into a disentangled feature space that is different from those of the other categories. The larger the absolute value of the difference between kc of two categories, the more different their feature space will be. The difference of feature spaces results in the diversity of decision surfaces among different categories without pre-training. In the extreme case with k1=k2==kC=d, all of subspaces are equal to χ𝐝 so that disentangled feature degenerates to general feature.

Fig. 3: The comparison of general feature and disentangled feature.

Meanwhile, we argue that for category c, the larger the proportion of the clips containing less interference from other event categories is, the more the class-wise identifiable information needs to be learned, which requires the larger volume of the feature space. In contrast, the smaller the proportion of these clips is, the smaller volume of the feature space is required to prevent overfitting. For this reason, kc increases as the proportion of these clips of category c increases.

Considering that too-small kc severely cut into the ability of the model to recognize category c, we utilize a constant factor m to tackle this effect, then,

kc=((1-m)fc+m)d (14)

where fc relates to the number of clips containing less interference in the training set. As m increases to 1, disentangled feature degrades into general feature. In our experiments, we set m=0.

We quantify the level of interference according to the principle that the more categories a clip covers, the more interference the other categories cause to any one of them, then,

fc=iCriNciR (15)
R=max𝑖riNci (16)

Here, Nci denotes the number of clips containing i categories including category c in the training set and ri is corresponding constant coefficient implying the importance of these clips. If

ri=1(1iC) (17)

fc only relates to the number of clips containing category c in the training set. We argue that the less interference the other categories cause to any one of them in a clip, the more important the clip is, for which we determine ri as:

ri=1i(1iC) (18)

We can also just consider those clips containing the least interference, then,

ri={1,i=10,otherwise (19)

To simplify training, we take an orthogonal basis ß′′={e1,e2,,ed} where the element of ei in ith dimensional is 1 for χ𝐝. Then kc basis vectors are related to kc dimensions of xt. As shown in Figure 3, we easily get a ladder-shape group of disentangled feature maps from feature encoder for a clip.

Fig. 4: The number of the clips where two categories occur in co-occurrence.

Combining disentangled feature 𝐱𝐜={xc1,xc2,,xcT} and cATP to generate the contextual representation of event category c, we have

𝐏(𝐲𝐱)=𝐏(𝐲𝐱𝐜)=𝐏(𝐲𝐡) (20)
hc=tactxct (21)
act=exp((wcTxct+bc)/d)kexp((wcTxck+bc)/d) (22)

IV Experiments

In this section, we introduce the dataset and describe in detail the model architecture, the re-processing, and post-processing method, the training configuration, and the evaluation measure in our experiments.

IV-A Dataset

We utilize the dataset from task 4 of the DCASE 2018 Challenge [21], which is a subset of Audioset [18] by Google. The set contains 1578 weak labeled clips (2244 class occurrences) for which weak annotations have been verified and cross-checked, 14412 unlabeled in domain clips, 39999 unlabeled out-of-domain clips and 1168 clips with strong annotations. The challenge divides strong labeled clips into two subsets: a validation set (288 clips) and an evaluation set (880 clips). In our experiments, we utilized the weakly labeled data to pre-train a clip-level classfication model to tag unlabeled in domain data with weak annotations and wipe off 1001 clips with empty annotations. Consequently, the training set in our experiments embraces 14989 clips with noisy weak annotations, the characteristic of which are large scale and unbalanced distribution as shown in Figure 4.

IV-B Model architecture

As shown in Figure 5, the model architecture employed in our experiments comprises three modules: the feature encoder, the pooling module, and the classifier. The feature encoder consists of 3 convolutional blocks, each of which comprises a convolutional layer, a batch normalization [22] layer, a max pooling layer (no temporal pooling), and an activation layer. The pooling modules including GAP, GMP, GSP and cATP are described in detail in Section II-B. We utilizes 1×1 convolutional layer with Sigmoid activation function as the classifier.

Different from other pooling modules in the prediction phase, cATP-SDS make frame-level prediction according to Equation 11 and Equation 12 discussed in Section III-A.

As for cATP-SDS-DF, we experimented with three different methods of determining constant coefficient ri discussed in Section III-B: cATP-SDS-DFN (Equation 17), cATP-SDS-DFW (Equation 18) and cATP-SDS-DF1 (Equation 19). Figure 7 illustrates the condition of cATP-SDS-DF1. More detailed disentangled dimession for each category of these three methods is shown in Table I.

Fig. 5: Model architecture
Fig. 6: cATP-SDS-DF1
TABLE I: The DF dimension and the window size of median window per category.
Event
DF dimension Window Size
DF1 DFW DFN (frame)
Alarm bell 46 31 42 17
ringing
Blender 22 22 27 42
Cat 92 43 67 17
Dishes 42 66 66 9
Dog 82 39 60 16
Electric shaver 17 16 19 74
toothbrush
Frying 13 41 35 85
Running water 160 75 116 64
Speech 74 160 160 18
Vacuum cleaner 85 35 57 87

IV-C Pre-processing and post-processing

The feature passed into feature encoder employed 64 log mel-bank magnitudes which are extracted from 40 ms frames with 50% overlap (nFFT=2048) using librosa package [23]. All the 10-second audio clips are extracted to feature vectors with 500 frames. The threshold of the predicted probability to determine whether an event category exists in a clip is 0.5. For frame-level prediction, all the probabilities are smoothed by a median filter with a group of adaptive window sizes. The operation of smoothing is repeated on the final frame-level prediction.

The adaptive window size of the median filter for category c is:

winc=durationcβ (23)

where durationc is the average duration of category c in the training set. In addition, we set β=13 and shows the specific window sizes in Table I.

IV-D Training and evaluation

The neural networks are trained using the Adam optimizer [24] with learning rate of 0.0018 and mini-batch of 64 10-second patches. The learning rate is reduced by 20% per 10 epochs. We take binary cross entropy as loss function. Training stops if there is no more improvement in clip-level macro F1 performance on the validation set within 10 epochs. The best performing model on the validation set will be retained for prediction before the training stops. All the experiments are repeated 20 times under the same parameter configuration. We took the average of all the results as the final result. In particular, in order to compare with the performance of the first place in the challenge, we report the best results among these 20 experiments in addition. Event-based measures [25] with a 200ms collar on onsets and a 200ms / 20% of the events length collar on offsets are calculated over the entire test set. The implementation of our methods is available online at https://github.com/Kikyo-16/Sound_event_detection.

TABLE II: The average performance of models (* means cATP-SDS).
Event detection (frame-level) Audio tagging (clip-level)
Model 𝐅𝟏 𝐏 𝐑 𝐅𝟏 𝐏 𝐑
GMP 0.229±0.032 0.233±0.048 0.239±0.041 0.624±0.021 0.666±0.031 0.613±0.036
GAP 0.239±0.019 0.241±0.025 0.266±0.019 0.606±0.025 0.652±0.028 0.604±0.029
GSP 0.244±0.021 0.241±0.028 0.272±0.027 0.608±0.041 0.654±0.021 0.602±0.036
cATP 0.252±0.018 0.246±0.022 0.298±0.030 0.628±0.021 0.669±0.028 0.640±0.031
cATP-SDS 0.354±0.011 0.369±0.028 0.374±0.018 0.628±0.021 0.669±0.028 0.640±0.031
*-DF1 0.364±0.034 0.378±0.029 0.377±0.034 0.638±0.022 0.682±0.030 0.623±0.031
*-DFW 0.362±0.026 0.368±0.023 0.378±0.023 0.648±0.037 0.693±0.022 0.630±0.031
*-DFN 0.346±0.026 0.356±0.038 0.360±0.024 0.639±0.022 0.677±0.022 0.637±0.038
TABLE III: The best performance of models.
Event detection Audio tagging
Model 𝐅𝟏 𝐏 𝐑 𝐅𝟏 𝐏 𝐑
GMP 0.262 0.281 0.250 0.645 0.685 0.629
GAP 0.251 0.247 0.253 0.630 0.667 0.613
GSP 0.265 0.268 0.245 0.629 0.658 0.625
cATP 0.270 0.268 0.325 0.644 0.688 0.645
cATP-SDS 0.364 0.387 0.378 0.644 0.688 0.645
*-DF1 0.385 0.390 0.373 0.660 0.702 0.644
*-DFW 0.382 0.390 0.368 0.665 0.694 0.642
*-DFN 0.367 0.395 0.354 0.652 0.676 0.642

V discussion

In this section, we report the results of our experiments and analyze in detail the distribution of test set data in the high-level feature space of models to prove our conjecture.

V-A Results

As shown in Table II, cATP-SDS-DF1 achieves the best performance of 0.364 on frame-level F1 score among all the models. The best performance of cATP-SDS-DF1 shown in Table III achieves 0.385, improving the performance by 6.1 percentage points from the first place [26] in the challenge. cATP-SDS-DFW achieves the best performance of 0.665 on clip-level F1 score among all the models. We illustrates the results of all 20 experiments in Figure 7. Since different window sizes of median filters in post-processing have a great impact on results, we show all the performances of models when window sizes are fixed of 27 and adaptive window sizes employ different β (β=1, β=2 and β=3) respectively.

Fig. 7: The frame-level F1 score of all 20 experiments of all the models with different window size of median filters.
Fig. 8: The comparison of frame-level possibilities output by the classifier of the model with the groundtruth.
Fig. 9: The decision surfaces for different categories in the feature space generated from feature encoder (PCA).
Fig. 10: The comparison of frame-level possibilities output by the classifier of the model with the groundtruth.

V-B The performances of different pooling modules

Compared the performances of cATP with other pooling modules in Table II, cATP is dominant in both event detection and audio tagging. We note that among GMP, GAP, and GSP, GMP performs best on audio tagging but worst on event detection. This is because that GMP makes a clip-level decision mainly depending on the boundaries of the high-level feature sequence, which guarantees more reliable prediction. However, this updating strategy also leads to the less reliable prediction of those frames distributing far from the boundaries of the sequence cluster. GAP raises performance on event detection by updating all the frames in a clip. But when making a clip-level prediction, negative frames in a positive clip interfere with the model to make a correct decision to a large extent. For example, as shown in Figure 8, GMP ignores those frames far from the boundaries of the sequence cluster and make extremely discontinuous predictions both for ”Blender” and ”Speech” while negative frames mislead GAP and GSP to make a false prediction for category ”Blender”. cATP achieve a better tradeoff between the two conditions above.

V-C The effect of SDS on event detection boundary

When we take SDS as decision surface for event detection, the frame-level performance of cATP is improved by 10.2 percentage points. To explain this phenomenon, we transform high-level feature sequences generated from the feature encoder into a two-dimensional space using Principal components analysis (PCA) for observation.

As shown in Figure 9, to highlight the frame-level decision surface, we only draw all the frames in which are predicted to be positive clips. Since yellow points represent frames predicted to be positive and purple points represent frames predicted to be negative, we can intuitively find that SDS clearly matches the potential decision surface which forms without supervision and divides the frames into two clusters. However, the sharing decision surface fails to achieve this point and leads to the weak performance of models with GMP, GAP, GSP and cATP on event detection. This observation exactly meets what we expect in Section III-A. As shown in Figure 10, we can intuitively see the advantage of SDS on event boundary detection.

Fig. 11: The class-wise performances of F1 on audio tagging of different models per category.
Fig. 12: The class-wise performances of F1 on event detection of different models per category.

V-D The effect of DF on multi-class classification

As shown in Figure 4, ”Dishes” and ”Frying” always occur in co-occurrence with each other and have a relatively small proportion in the training set while ”Speech” always occurs in co-occurrence with any other categories. ”Running water” also always occurs in co-occurrence with ”Dishes”. When we focus on these four categories, we find that cATP-SDS-DF1 and cATP-SDS-DFW did imporve class-wise performances of ”Dishes”, ”Frying” and ”Running water” both on event detection and audio tagging as shown in Figure 11 and Figure 12.

We argue that the high rate of co-occurrence between these categories and small scale of samples imply less identifiable information of each category, which increases the difficulty to learn better contextual represents in the high-level feature space.

However, when we prepare a specific subspace for each category using DF, the volume of these subspaces actually are greatly reduced, making it easier to fit a small amount of data containing identifiable information. As shown in Figure 13, these subspaces can be easily distinguished from each other without pre-training, which strengthens the anti-interference ability between categories of the model. As shown in 14, the clusters of the contextual representations with DF are more compact than those without, especially for ”Frying”.

In addition, the weaker performance of cATP-SDS-DFN which concentrates more on the scale of data samples also demonstrates the power of the prior information about co-occurrence in the unbalanced dataset.

Fig. 13: The contextual representations predicted to be positive of different categories of the test set (PCA).

VI Conclusion

In this paper, we introduce a specialized decision surface (SDS) and a disentangled feature (DF) for weakly-supervised polyphonic sound event detection. Firstly, we approach it as a MIL problem and then introduce a MIL framework with neural networks and pooling module. This framework is common in some weakly-supervised tasks, and to give the reader a sense of how it works on SED tasks, we grouped it into two broad categories: MIL with instance-level and embedding-level pooling modules. Since the embedding-level MIL are preferable in terms of the clip-level performance, we explore how different pooling modules work, based on which we are able to explain the superiority of class-wise attention pooling module. The exploration of the high-level feature space generated by neural network feature encoder leads to the discovery of an unsupervised potential decision surface, which we term as specialized decision surface (SDS). This decision surface exactly explains the power of the class-wise attention pooling module and provides a better decision surface than the conventional sharing surface in MIL for event detection. It is not new to solve MIL problem using attention pooling module, but to our best knowledge, we are the first to explain why attention pooling bring better frame-level (instance-level) prediction from the perspective of high-level feature space and its potential decision surface. This explanation will hopefully help further work to improve the potential decision surface in attention pooling modules for weakly-supervised learning.

Fig. 14: The clusters of contextual representations of two different categories (PCA). Orange points represent positive clips while blue points represent negative clips per category.

Secondly, to tackle the common problem causing by category co-occurrence between categories and data imbalance in the multi-label task, we propose a disentangled feature, which determines several certain subspaces for different categories without pre-training according to the prior information in the training set. In terms of optimizing the structure of the neural network, DF reduces redundant weights in the network, which not only eases the over-fitting but also improves the training efficiency. From the perspective of feature space, DF optimizes the feature encoder and reduces the volume of high-level feature space of categories with insufficient samples, thus making it easier to learn more compact distribution. At the same time, DF, combined with prior information about co-occurrence between categories, reduces the interference between categories and improves the performance of the model.

Finally, we experiment with our approaches on the dataset of DCASE2018 task4 and confirm our conjecture, reaching the state-of-art results.

References

  • [1] O. Maron and T. Lozano-Pérez, “A framework for multiple-instance learning,” in Advances in neural information processing systems, 1998, pp. 570–576.
  • [2] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez, “Solving the multiple instance problem with axis-parallel rectangles,” Artificial intelligence, vol. 89, no. 1-2, pp. 31–71, 1997.
  • [3] O. Z. Kraus, J. L. Ba, and B. J. Frey, “Classifying and segmenting microscopy images with deep multiple instance learning,” Bioinformatics, vol. 32, no. 12, pp. i52–i59, 2016.
  • [4] D. Pathak, E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional multi-class multiple instance learning,” arXiv preprint arXiv:1412.7144, 2014.
  • [5] Z.-H. Zhou and M.-L. Zhang, “Neural networks for multi-instance learning,” in Proceedings of the International Conference on Intelligent Information Technology, Beijing, China, 2002, pp. 455–459.
  • [6] G. Quellec, G. Cazuguel, B. Cochener, and M. Lamard, “Multiple-instance learning for medical image and video analysis,” IEEE reviews in biomedical engineering, vol. 10, pp. 213–234, 2017.
  • [7] Y. Xu, T. Mo, Q. Feng, P. Zhong, M. Lai, I. Eric, and C. Chang, “Deep learning of feature representation with multiple instance learning for medical image analysis,” in 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2014, pp. 1626–1630.
  • [8] G. Papandreou, L.-C. Chen, K. P. Murphy, and A. L. Yuille, “Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1742–1750.
  • [9] J. Wu, Y. Zhao, J.-Y. Zhu, S. Luo, and Z. Tu, “Milcut: A sweeping line multiple instance learning paradigm for interactive image segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 256–263.
  • [10] S.-Y. Tseng, J. Li, Y. Wang, J. Szurley, F. Metze, and S. Das, “Multiple instance deep learning for weakly supervised small-footprint audio event detection,” arXiv preprint arXiv:1712.09673, 2017.
  • [11] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2921–2929.
  • [12] R. Serizel, N. Turpault, H. Eghbal-Zadeh, and A. Parag Shah, “Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments,” in Workshop on Detection and Classification of Acoustic Scenes and Events, Woking, United Kingdom, Nov. 2018, submitted to DCASE2018 Workshop. [Online]. Available: https://hal.inria.fr/hal-01850270
  • [13] Y. Wang, J. Li, and F. Metze, “Comparing the max and noisy-or pooling functions in multiple instance learning for weakly supervised sequence learning tasks,” Proc. Interspeech 2018, pp. 1339–1343, 2018.
  • [14] M. Ilse, J. M. Tomczak, and M. Welling, “Attention-based deep multiple instance learning,” arXiv preprint arXiv:1802.04712, 2018.
  • [15] X. Lu, P. Shen, S. Li, Y. Tsao, and H. Kawai, “Temporal attentive pooling for acoustic event detection,” in Proc. Interspeech, 2018, pp. 1354–1357.
  • [16] B. McFee, J. Salamon, and J. P. Bello, “Adaptive pooling operators for weakly labeled sound event detection,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 26, no. 11, pp. 2180–2193, 2018.
  • [17] X. Wang, Y. Yan, P. Tang, X. Bai, and W. Liu, “Revisiting multiple instance neural networks,” Pattern Recognition, vol. 74, pp. 15–24, 2018.
  • [18] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2017, pp. 776–780.
  • [19] Q. Kong, C. Yu, T. Iqbal, Y. Xu, W. Wang, and M. D. Plumbley, “Weakly labelled audioset classification with attention neural networks,” arXiv preprint arXiv:1903.00765, 2019.
  • [20] J. Wu, Y. Yu, C. Huang, and K. Yu, “Deep multiple instance learning for image classification and auto-annotation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3460–3469.
  • [21] R. Serizel, N. Turpault, H. Eghbal-Zadeh, and A. Parag Shah, “Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments,” July 2018, submitted to DCASE2018 Workshop. [Online]. Available: https://hal.inria.fr/hal-01850270
  • [22] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  • [23] Brian McFee, Colin Raffel, Dawen Liang, Daniel P.W. Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto, “librosa: Audio and Music Signal Analysis in Python,” in Proceedings of the 14th Python in Science Conference, Kathryn Huff and James Bergstra, Eds., 2015, pp. 18 – 24.
  • [24] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [25] A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for polyphonic sound event detection,” Applied Sciences, vol. 6, no. 6, p. 162, 2016. [Online]. Available: http://www.mdpi.com/2076-3417/6/6/162
  • [26] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” in Advances in neural information processing systems, 2017, pp. 1195–1204.

Liwei Lin received the B.Sc. degree in Computer Science from China Agricultural University, Beijing, China, in 2017. She is currently pursuing a M.E. degree in Computer Science at Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. her research interest includes audio signal processing and machine learning.

Xiangdong Wang is an associate professor in Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. He received Doctor’s degree in Computer Science at Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2007. His research field includes human-computer interaction, speech recognition and audio processing.

Hong Liu is an associate professor in Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. She received her Doctor’s degree in Computer Science at Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2007. Her research field includes human-computer interaction, multimedia technology, and video processing.

Yueliang Qian is a professor in Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. He received his Bachelor’s degree in Computer Science at Fudan University, Shanghai, China in 1983. His research field includes human-computer interaction and pervasive computing.