Specialized Decision Surface and Disentangled Feature for Weakly-Supervised Polyphonic Sound Event Detection

  • 2019-07-10 16:41:13
  • Liwei Lin, Xiangdong Wang, Hong Liu, Yueliang Qian
  • 0

Abstract

Sound event detection (SED) is to recognize the presence of sound events inthe segment of audio and detect their onset as well as offset. SED can beregarded as a supervised learning task when strong annotations (timestamps) areavailable during learning. However, due to the high cost of manual stronglabeling data, it becomes crucial to introduce weakly supervised learning toSED, in which only weak annotations (clip-level annotations without timestamps)are available during learning. In this paper, we approach SED as a multipleinstance learning (MIL) problem and utilize a neural network framework with anembedding-level pooling module to solve it. The pooling module, whichaggregates a sequence of high-level features generated by the neural networkfeature encoder into a single contextual feature representation, enables themodel to learn with only weak annotations. We explore the self-learning abilityof different pooling modules on finer information and propose a specializeddecision surface (SDS) for class-wise attention pooling (cATP) module. Weanalyze and explained why a cATP module with SDS is better than other typicalpooling modules from the perspective of feature space. According to theco-occurrence of several categories in the multi-label classification task, wealso propose a disentangled feature (DF) to reduce interference betweencategories, which optimizes the high-level feature space by disentangling itbased on class-wise identifiable information in the training set and obtainingmultiple different subspaces. Experiments show that our approach achievesstate-of-art performance on Task4 of the DCASE2018 challenge.

 

Quick Read (beta)

loading the full paper ...