Sound event detection (SED) is to recognize the presence of sound events inthe segment of audio and detect their onset as well as offset. SED can beregarded as a supervised learning task when strong annotations (timestamps) areavailable during learning. However, due to the high cost of manual stronglabeling data, it becomes crucial to introduce weakly supervised learning toSED, in which only weak annotations (clip-level annotations without timestamps)are available during learning. In this paper, we approach SED as a multipleinstance learning (MIL) problem and utilize a neural network framework with anembedding-level pooling module to solve it. The pooling module, whichaggregates a sequence of high-level features generated by the neural networkfeature encoder into a single contextual feature representation, enables themodel to learn with only weak annotations. We explore the self-learning abilityof different pooling modules on finer information and propose a specializeddecision surface (SDS) for class-wise attention pooling (cATP) module. Weanalyze and explained why a cATP module with SDS is better than other typicalpooling modules from the perspective of feature space. According to theco-occurrence of several categories in the multi-label classification task, wealso propose a disentangled feature (DF) to reduce interference betweencategories, which optimizes the high-level feature space by disentangling itbased on class-wise identifiable information in the training set and obtainingmultiple different subspaces. Experiments show that our approach achievesstate-of-art performance on Task4 of the DCASE2018 challenge.
Quick Read (beta)
Specialized Decision Surface and Disentangled Feature for Weakly-Supervised Polyphonic Sound Event Detection
Sound event detection (SED) is to recognize the presence of sound events in the segment of audio and detect their onset as well as offset. SED can be regarded as a supervised learning task when strong annotations (timestamps) are available during learning. However, due to the high cost of manual strong labeling data, it becomes crucial to introduce weakly supervised learning to SED, in which only weak annotations (clip-level annotations without timestamps) are available during learning.
In this paper, we approach SED as a multiple instance learning (MIL) problem and utilize a neural network framework with an embedding-level pooling module to solve it. The pooling module, which aggregates a sequence of high-level features generated by the neural network feature encoder into a single contextual feature representation, enables the model to learn with only weak annotations. We explore the self-learning ability of different pooling modules on finer information and propose a specialized decision surface (SDS) for class-wise attention pooling (cATP) module. We analyze and explained why a cATP module with SDS is better than other typical pooling modules from the perspective of feature space. According to the co-occurrence of several categories in the multi-label classification task, we also propose a disentangled feature (DF) to reduce interference between categories, which optimizes the high-level feature space by disentangling it based on class-wise identifiable information in the training set and obtaining multiple different subspaces. Experiments show that our approach achieves state-of-art performance on Task4 of the DCASE2018 challenge. On this basis,
Sound event detection (SED) is the task to detect and recognize individual sound sources in realistic soundscapes. It is required to recognize not only the presence of each event category in a sound source but also the start and end boundaries of each existing event. Annotations with such timestamps for all occurrences are termed as strong annotations, while weak annotations only indicate the presence of event categories. Due to the difficulty in obtaining large-scale strongly annotated training data, weakly supervised learning with only weakly annotated training data has become a new focus in research on SED.
Weakly supervised learning is often approached as a MIL problem , . The excellent performance of neural networks in various fields promotes the combination of the MIL framework and neural networks for weakly supervised learning , , . It is especially common in medical imaging ,  and semantic segmentation , . Tseng et al.  proposed a small-footprint MIL framework for multi-class audio event detection (AED), which treats segments in an audio clip as a bag of instances and utilizes global max pooling (GMP) to integrate them. In addition to GMP, global average pooling (GAP) , , noisy-or pooling  and attention pooling ,  are also in common use.
The procedure of the MIL approach with neural networks for SED embraces three stages:
Encode audio features to a high-level representation (or the frame-level probabilities) by means of neural networks.
Aggregate all the high-level representations (or the frame-level probabilities) into a contextual representation (or clip-level probability) via a pooling module.
Pass the contextual representation into a classifier to obtain the clip-level prediction (or take the clip-level probability as the clip-level prediction directly when utilizing the frame-level probabilities in stage 2).
The approaches are distinguished as instance-level approaches and embedding-level approaches according to whether the high-level representation or the frame-level probabilities are employed before aggregation in the second part above. The instance-level approaches add the pooling layer to the frame-level probabilities to obtain the clip-level probability directly , . The embedding-level approaches add the pooling layer to the high-level representation generated by neural networks to obtain a contextual representation, and the clip-level probability is further acquired . As mentioned in  and , the embedding-level approaches are preferable in terms of the clip-level performance, which is demonstrated on Audioset , .
I-A Our contributions
In this paper, we approach SED as a multiple instance learning (MIL) problem, and utilize a neural network framework with a embedding-level class-wise attention pooling (cATP) module to solve it. We explain why a pooling module is able to learn the potential instant-level decision surface when learns the bag-level decision surface from the high-level feature space. Compared to other pooling modules such as GMP and GAP, frame-level (instant-level) features and clip-level (bag-level) features of which share the same decision surface, cATP introduces a separate decision surface for the frame-level feature. We term it as specialized decision surface (SDS) and demonstrate it is more conducive to frame-level feature classification than the common sharing decision surface mentioned above.
Furthermore, we propose a disentangled feature (DF) to help multi-class learning. Due to the unbalanced data set and the fact that multiple categories always occur in concert at clip-level, it is difficult for cATP to learn multiple separate feature subspaces for those event categories that are highly overlapped with other event categories, especially for those with relatively few occurrences. Therefore, by taking into account the category overlap information, we propose a disentangled feature which re-models the high-level feature space so that the feature subspace of a certain category differs from other categories without pre-training. The scales of these disentangled feature subspaces depend on how many available clips containing strong class-wise identifiable information with less interference from other categories. In virtue of the introduction of more class-wise prior information as well as the network redundancy weight reduction, the disentangled feature can be regarded as a regularization method to help improve the performance of cATP-MIL frameworks.
Our experiments showed that cATP with SDS and DF outperforms other pooling modules and simple cATP. Detailed analysis of the high-level feature space also supported our hypothesis.
II Related work
II-A Multiple instance learning
According to MIL, all the samples with global annotations are treated as bags of instances. If there is at least one positive instance in a bag, the bag is labeled as a positive bag. Since neural networks have been widely used as a general high-level feature extractor in various tasks, the application of MIL based on neural network typically focuses on adding a pooling module to the highest layer of the network , , which aggregates multiple output scores for all the instances into a bag-level score so that the model is able to calculate the loss using only a weak annotation.
For the multi-class classification, as shown in Figure 1, let be the high-level feature sequence of the audio cilp generated by feature encoder and () be the groundtruths, where is the number of categories.
Consider the audio cilp as a bag with several instances (frames). According to MIL, the audio cilp is marked with a positive label when there is at least one positive frame in the audio cilp. When learning, different pooling modules indicate different strategies for associating the annotation of the audio cilp with that of those frames.
Instant-level pooling modules aggregates frame-level predictions into a cilp-level prediction:
Embedding-level pooling modules aggregates high-level feature sequence into a contextual representation :
When making predictions, assuming is a threshold both for clip-level prediction and frame-level prediction. Then the clip-level prediction for event category is:
The frame-level prediction for event at time is:
As discussion in Section I, the embedding-level MIL are preferable in terms of the clip-level performance, we only discuss the embedding-level MIL in the remaining sections.
II-B Pooling modules
In this section, we introduce different typical pooling modules including global max pooling (GMP), global average pooling (GAP) and global softmax pooling (GSP) for the embedding-level MIL.
For GMP and GAP, assuming the contextual representation is a dimensional vector, then the component of for GMP is,
And the component of for GAP is,
Obviously, both GMP and GAP generate a contextual representation with taking account of the information of the whole sequence. However, one of the problems among them is that they don’t take into account the different contributions of to the contextual representation at different times in different categories.
GSP fixes this defect by generating different contextual representations , then,
where is a function to scale appropriately.
Therefore, GSP attempts to separate clip-level embeddings of different categories to different subspaces, enabling the model to be more flexible in self-learning.
III-A Specialized decision surface
According to Equation 5, all the pooling modules introduced in Section II-B default that not only the model learns the decision surface explicitly for but also learn the potential decision surface for , and this potential decision surface gradually approaches the decision surface of .
As shown in Figure 2, when and are depicted in the same feature space, we intuitively show how the pooling module performs weakly supervised learning. For example, GMP tends to update the boundaries of while GAP tends to update in a positive clip forward toward the positive decision surface ignoring mistakes caused by several negative frames in the clip. The mistake made by GAP can be eased when in a negative clip is updated. Similarly, according to Equation 9, when updating , GSP tends to update those frames that fall on (or near) one side of the decision surface forward toward the same direction. This is a tradeoff between bulk updates and fewer mistakes as well as a tradeoff between GMP and GAP, for which GSP is expected to performs better than both GMP and GAP. Simultaneously, the feature space formed by these update strategies finally enables and to share the same decision surface, although this sharing decision surface seems to be kind of mismatch when it comes to the classification of the later.
On the basis of these observations, we proposed an explanation for the superiority of class-wise attention pooling (cATP) module.
Similarly to GSP, cATP also employs weighted factors to distinguish how importance to for each category , and the weighted factor depends on a trainable vector and a bias instead of , then,
where a scaling factor to prevent the dot products from growing so large in magnitude that pushing the softmax function into regions where it has extremely small gradients.
We argue that free parameters and determine a specialized decision surface (SDS) to select important frames to update. is grouped into two clusters in its feature space and SDS is exactly the decision surface formed in this unsupervised process.
As shown in Figure 2(d), compared with the sharing decision surface of GSP, SDS of cATP allows more flexible selection of positive (or negative) frames. As a result, fewer mistakes are made when updating, leading to better performance on classification.
We also note that SDS is not only better at determining the importance of frames but also more suitable to be the decision surface of than the sharing decision surface using in MIL.
From this point on, the frame-level prediction for event at time is:
where is Sigmoid function.
III-B Disentangled feature
cATP with SDS explicitly separate and to two subspaces, namely, learning feature distribution and decision surface in the former subspace with supervision and in the later subspace without supervision. When it comes to multi-label classification, the fact that a certain category always occurs in co-occurrence with other categories makes it difficult to differentiate the feature of this category which distributed in the former subspace from other categories. This effect will be exacerbated when the number of clips with much identifiable information of certain categories in the unbalanced set is particularly small.
To mitigate this effect, we propose a disentangled feature (DF) to re-model multiple feature subspaces for by selecting specific bases for the feature space of each event category. Since is produced by according to Equation 8, the feature space of is also re-modeled into feature subspaces.
Assuming that () is a d-dimensional space generated by the feature encoder and is a basis of . We define , a subspace of , as the feature space of event category , then the basis of is
where relates to how large the scale of the clips containing less interference available during training is.
In this way, the diversity of elements in leads to the feature space of each category to be remodeled into a disentangled feature space that is different from those of the other categories. The larger the absolute value of the difference between of two categories, the more different their feature space will be. The difference of feature spaces results in the diversity of decision surfaces among different categories without pre-training. In the extreme case with , all of subspaces are equal to so that disentangled feature degenerates to general feature.
Meanwhile, we argue that for category , the larger the proportion of the clips containing less interference from other event categories is, the more the class-wise identifiable information needs to be learned, which requires the larger volume of the feature space. In contrast, the smaller the proportion of these clips is, the smaller volume of the feature space is required to prevent overfitting. For this reason, increases as the proportion of these clips of category increases.
Considering that too-small severely cut into the ability of the model to recognize category , we utilize a constant factor to tackle this effect, then,
where relates to the number of clips containing less interference in the training set. As increases to , disentangled feature degrades into general feature. In our experiments, we set .
We quantify the level of interference according to the principle that the more categories a clip covers, the more interference the other categories cause to any one of them, then,
Here, denotes the number of clips containing categories including category in the training set and is corresponding constant coefficient implying the importance of these clips. If
only relates to the number of clips containing category in the training set. We argue that the less interference the other categories cause to any one of them in a clip, the more important the clip is, for which we determine as:
We can also just consider those clips containing the least interference, then,
To simplify training, we take an orthogonal basis where the element of in dimensional is 1 for . Then basis vectors are related to dimensions of . As shown in Figure 3, we easily get a ladder-shape group of disentangled feature maps from feature encoder for a clip.
Combining disentangled feature and cATP to generate the contextual representation of event category , we have
In this section, we introduce the dataset and describe in detail the model architecture, the re-processing, and post-processing method, the training configuration, and the evaluation measure in our experiments.
We utilize the dataset from task 4 of the DCASE 2018 Challenge , which is a subset of Audioset  by Google. The set contains 1578 weak labeled clips (2244 class occurrences) for which weak annotations have been verified and cross-checked, 14412 unlabeled in domain clips, 39999 unlabeled out-of-domain clips and 1168 clips with strong annotations. The challenge divides strong labeled clips into two subsets: a validation set (288 clips) and an evaluation set (880 clips). In our experiments, we utilized the weakly labeled data to pre-train a clip-level classfication model to tag unlabeled in domain data with weak annotations and wipe off 1001 clips with empty annotations. Consequently, the training set in our experiments embraces 14989 clips with noisy weak annotations, the characteristic of which are large scale and unbalanced distribution as shown in Figure 4.
IV-B Model architecture
As shown in Figure 5, the model architecture employed in our experiments comprises three modules: the feature encoder, the pooling module, and the classifier. The feature encoder consists of 3 convolutional blocks, each of which comprises a convolutional layer, a batch normalization  layer, a max pooling layer (no temporal pooling), and an activation layer. The pooling modules including GAP, GMP, GSP and cATP are described in detail in Section II-B. We utilizes convolutional layer with Sigmoid activation function as the classifier.
As for cATP-SDS-DF, we experimented with three different methods of determining constant coefficient discussed in Section III-B: cATP-SDS-DFN (Equation 17), cATP-SDS-DFW (Equation 18) and cATP-SDS-DF1 (Equation 19). Figure 7 illustrates the condition of cATP-SDS-DF1. More detailed disentangled dimession for each category of these three methods is shown in Table I.
|DF dimension||Window Size|
IV-C Pre-processing and post-processing
The feature passed into feature encoder employed 64 log mel-bank magnitudes which are extracted from 40 ms frames with overlap using librosa package . All the 10-second audio clips are extracted to feature vectors with 500 frames. The threshold of the predicted probability to determine whether an event category exists in a clip is . For frame-level prediction, all the probabilities are smoothed by a median filter with a group of adaptive window sizes. The operation of smoothing is repeated on the final frame-level prediction.
The adaptive window size of the median filter for category is:
where is the average duration of category in the training set. In addition, we set and shows the specific window sizes in Table I.
IV-D Training and evaluation
The neural networks are trained using the Adam optimizer  with learning rate of and mini-batch of 10-second patches. The learning rate is reduced by per epochs. We take binary cross entropy as loss function. Training stops if there is no more improvement in clip-level macro performance on the validation set within epochs. The best performing model on the validation set will be retained for prediction before the training stops. All the experiments are repeated times under the same parameter configuration. We took the average of all the results as the final result. In particular, in order to compare with the performance of the first place in the challenge, we report the best results among these experiments in addition. Event-based measures  with a 200ms collar on onsets and a 200ms / of the events length collar on offsets are calculated over the entire test set. The implementation of our methods is available online at https://github.com/Kikyo-16/Sound_event_detection.
|Event detection (frame-level)||Audio tagging (clip-level)|
|Event detection||Audio tagging|
In this section, we report the results of our experiments and analyze in detail the distribution of test set data in the high-level feature space of models to prove our conjecture.
As shown in Table II, cATP-SDS-DF1 achieves the best performance of on frame-level score among all the models. The best performance of cATP-SDS-DF1 shown in Table III achieves , improving the performance by percentage points from the first place  in the challenge. cATP-SDS-DFW achieves the best performance of on clip-level score among all the models. We illustrates the results of all 20 experiments in Figure 7. Since different window sizes of median filters in post-processing have a great impact on results, we show all the performances of models when window sizes are fixed of 27 and adaptive window sizes employ different (, and ) respectively.
V-B The performances of different pooling modules
Compared the performances of cATP with other pooling modules in Table II, cATP is dominant in both event detection and audio tagging. We note that among GMP, GAP, and GSP, GMP performs best on audio tagging but worst on event detection. This is because that GMP makes a clip-level decision mainly depending on the boundaries of the high-level feature sequence, which guarantees more reliable prediction. However, this updating strategy also leads to the less reliable prediction of those frames distributing far from the boundaries of the sequence cluster. GAP raises performance on event detection by updating all the frames in a clip. But when making a clip-level prediction, negative frames in a positive clip interfere with the model to make a correct decision to a large extent. For example, as shown in Figure 8, GMP ignores those frames far from the boundaries of the sequence cluster and make extremely discontinuous predictions both for ”Blender” and ”Speech” while negative frames mislead GAP and GSP to make a false prediction for category ”Blender”. cATP achieve a better tradeoff between the two conditions above.
V-C The effect of SDS on event detection boundary
When we take SDS as decision surface for event detection, the frame-level performance of cATP is improved by percentage points. To explain this phenomenon, we transform high-level feature sequences generated from the feature encoder into a two-dimensional space using Principal components analysis (PCA) for observation.
As shown in Figure 9, to highlight the frame-level decision surface, we only draw all the frames in which are predicted to be positive clips. Since yellow points represent frames predicted to be positive and purple points represent frames predicted to be negative, we can intuitively find that SDS clearly matches the potential decision surface which forms without supervision and divides the frames into two clusters. However, the sharing decision surface fails to achieve this point and leads to the weak performance of models with GMP, GAP, GSP and cATP on event detection. This observation exactly meets what we expect in Section III-A. As shown in Figure 10, we can intuitively see the advantage of SDS on event boundary detection.
V-D The effect of DF on multi-class classification
As shown in Figure 4, ”Dishes” and ”Frying” always occur in co-occurrence with each other and have a relatively small proportion in the training set while ”Speech” always occurs in co-occurrence with any other categories. ”Running water” also always occurs in co-occurrence with ”Dishes”. When we focus on these four categories, we find that cATP-SDS-DF1 and cATP-SDS-DFW did imporve class-wise performances of ”Dishes”, ”Frying” and ”Running water” both on event detection and audio tagging as shown in Figure 11 and Figure 12.
We argue that the high rate of co-occurrence between these categories and small scale of samples imply less identifiable information of each category, which increases the difficulty to learn better contextual represents in the high-level feature space.
However, when we prepare a specific subspace for each category using DF, the volume of these subspaces actually are greatly reduced, making it easier to fit a small amount of data containing identifiable information. As shown in Figure 13, these subspaces can be easily distinguished from each other without pre-training, which strengthens the anti-interference ability between categories of the model. As shown in 14, the clusters of the contextual representations with DF are more compact than those without, especially for ”Frying”.
In addition, the weaker performance of cATP-SDS-DFN which concentrates more on the scale of data samples also demonstrates the power of the prior information about co-occurrence in the unbalanced dataset.
In this paper, we introduce a specialized decision surface (SDS) and a disentangled feature (DF) for weakly-supervised polyphonic sound event detection. Firstly, we approach it as a MIL problem and then introduce a MIL framework with neural networks and pooling module. This framework is common in some weakly-supervised tasks, and to give the reader a sense of how it works on SED tasks, we grouped it into two broad categories: MIL with instance-level and embedding-level pooling modules. Since the embedding-level MIL are preferable in terms of the clip-level performance, we explore how different pooling modules work, based on which we are able to explain the superiority of class-wise attention pooling module. The exploration of the high-level feature space generated by neural network feature encoder leads to the discovery of an unsupervised potential decision surface, which we term as specialized decision surface (SDS). This decision surface exactly explains the power of the class-wise attention pooling module and provides a better decision surface than the conventional sharing surface in MIL for event detection. It is not new to solve MIL problem using attention pooling module, but to our best knowledge, we are the first to explain why attention pooling bring better frame-level (instance-level) prediction from the perspective of high-level feature space and its potential decision surface. This explanation will hopefully help further work to improve the potential decision surface in attention pooling modules for weakly-supervised learning.
Secondly, to tackle the common problem causing by category co-occurrence between categories and data imbalance in the multi-label task, we propose a disentangled feature, which determines several certain subspaces for different categories without pre-training according to the prior information in the training set. In terms of optimizing the structure of the neural network, DF reduces redundant weights in the network, which not only eases the over-fitting but also improves the training efficiency. From the perspective of feature space, DF optimizes the feature encoder and reduces the volume of high-level feature space of categories with insufficient samples, thus making it easier to learn more compact distribution. At the same time, DF, combined with prior information about co-occurrence between categories, reduces the interference between categories and improves the performance of the model.
Finally, we experiment with our approaches on the dataset of DCASE2018 task4 and confirm our conjecture, reaching the state-of-art results.
-  O. Maron and T. Lozano-Pérez, “A framework for multiple-instance learning,” in Advances in neural information processing systems, 1998, pp. 570–576.
-  T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez, “Solving the multiple instance problem with axis-parallel rectangles,” Artificial intelligence, vol. 89, no. 1-2, pp. 31–71, 1997.
-  O. Z. Kraus, J. L. Ba, and B. J. Frey, “Classifying and segmenting microscopy images with deep multiple instance learning,” Bioinformatics, vol. 32, no. 12, pp. i52–i59, 2016.
-  D. Pathak, E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional multi-class multiple instance learning,” arXiv preprint arXiv:1412.7144, 2014.
-  Z.-H. Zhou and M.-L. Zhang, “Neural networks for multi-instance learning,” in Proceedings of the International Conference on Intelligent Information Technology, Beijing, China, 2002, pp. 455–459.
-  G. Quellec, G. Cazuguel, B. Cochener, and M. Lamard, “Multiple-instance learning for medical image and video analysis,” IEEE reviews in biomedical engineering, vol. 10, pp. 213–234, 2017.
-  Y. Xu, T. Mo, Q. Feng, P. Zhong, M. Lai, I. Eric, and C. Chang, “Deep learning of feature representation with multiple instance learning for medical image analysis,” in 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2014, pp. 1626–1630.
-  G. Papandreou, L.-C. Chen, K. P. Murphy, and A. L. Yuille, “Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1742–1750.
-  J. Wu, Y. Zhao, J.-Y. Zhu, S. Luo, and Z. Tu, “Milcut: A sweeping line multiple instance learning paradigm for interactive image segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 256–263.
-  S.-Y. Tseng, J. Li, Y. Wang, J. Szurley, F. Metze, and S. Das, “Multiple instance deep learning for weakly supervised small-footprint audio event detection,” arXiv preprint arXiv:1712.09673, 2017.
-  B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2921–2929.
-  R. Serizel, N. Turpault, H. Eghbal-Zadeh, and A. Parag Shah, “Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments,” in Workshop on Detection and Classification of Acoustic Scenes and Events, Woking, United Kingdom, Nov. 2018, submitted to DCASE2018 Workshop. [Online]. Available: https://hal.inria.fr/hal-01850270
-  Y. Wang, J. Li, and F. Metze, “Comparing the max and noisy-or pooling functions in multiple instance learning for weakly supervised sequence learning tasks,” Proc. Interspeech 2018, pp. 1339–1343, 2018.
-  M. Ilse, J. M. Tomczak, and M. Welling, “Attention-based deep multiple instance learning,” arXiv preprint arXiv:1802.04712, 2018.
-  X. Lu, P. Shen, S. Li, Y. Tsao, and H. Kawai, “Temporal attentive pooling for acoustic event detection,” in Proc. Interspeech, 2018, pp. 1354–1357.
-  B. McFee, J. Salamon, and J. P. Bello, “Adaptive pooling operators for weakly labeled sound event detection,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 26, no. 11, pp. 2180–2193, 2018.
-  X. Wang, Y. Yan, P. Tang, X. Bai, and W. Liu, “Revisiting multiple instance neural networks,” Pattern Recognition, vol. 74, pp. 15–24, 2018.
-  J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 776–780.
-  Q. Kong, C. Yu, T. Iqbal, Y. Xu, W. Wang, and M. D. Plumbley, “Weakly labelled audioset classification with attention neural networks,” arXiv preprint arXiv:1903.00765, 2019.
-  J. Wu, Y. Yu, C. Huang, and K. Yu, “Deep multiple instance learning for image classification and auto-annotation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3460–3469.
-  R. Serizel, N. Turpault, H. Eghbal-Zadeh, and A. Parag Shah, “Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments,” July 2018, submitted to DCASE2018 Workshop. [Online]. Available: https://hal.inria.fr/hal-01850270
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
-  Brian McFee, Colin Raffel, Dawen Liang, Daniel P.W. Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto, “librosa: Audio and Music Signal Analysis in Python,” in Proceedings of the 14th Python in Science Conference, Kathryn Huff and James Bergstra, Eds., 2015, pp. 18 – 24.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for polyphonic sound event detection,” Applied Sciences, vol. 6, no. 6, p. 162, 2016. [Online]. Available: http://www.mdpi.com/2076-3417/6/6/162
-  A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” in Advances in neural information processing systems, 2017, pp. 1195–1204.
Liwei Lin received the B.Sc. degree in Computer Science from China Agricultural University, Beijing, China, in 2017. She is currently pursuing a M.E. degree in Computer Science at Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. her research interest includes audio signal processing and machine learning.
Xiangdong Wang is an associate professor in Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. He received Doctor’s degree in Computer Science at Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2007. His research field includes human-computer interaction, speech recognition and audio processing.
Hong Liu is an associate professor in Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. She received her Doctor’s degree in Computer Science at Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2007. Her research field includes human-computer interaction, multimedia technology, and video processing.
Yueliang Qian is a professor in Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. He received his Bachelor’s degree in Computer Science at Fudan University, Shanghai, China in 1983. His research field includes human-computer interaction and pervasive computing.