X-DC: Explainable Deep Clustering based on Learnable Spectrogram Templates

Abstract

Deep neural networks (DNNs) have achieved substantial predictive performancein various speech processing tasks. Particularly, it has been shown that amonaural speech separation task can be successfully solved with a DNN-basedmethod called deep clustering (DC), which uses a DNN to describe the process ofassigning a continuous vector to each time-frequency (TF) bin and measure howlikely each pair of TF bins is to be dominated by the same speaker. In DC, theDNN is trained so that the embedding vectors for the TF bins dominated by thesame speaker are forced to get close to each other. One concern regarding DC isthat the embedding process described by a DNN has a black-box structure, whichis usually very hard to interpret. The potential weakness owing to thenon-interpretable black-box structure is that it lacks the flexibility ofaddressing the mismatch between training and test conditions (caused byreverberation, for instance). To overcome this limitation, in this paper, wepropose the concept of explainable deep clustering (X-DC), whose networkarchitecture can be interpreted as a process of fitting learnable spectrogramtemplates to an input spectrogram followed by Wiener filtering. Duringtraining, the elements of the spectrogram templates and their activations areconstrained to be non-negative, which facilitates the sparsity of their valuesand thus improves interpretability. The main advantage of this framework isthat it naturally allows us to incorporate a model adaptation mechanism intothe network thanks to its physically interpretable structure. We experimentallyshow that the proposed X-DC enables us to visualize and understand the cluesfor the model to determine the embedding vectors while achieving speechseparation performance comparable to that of the original DC models.

Quick Read (beta)

loading the full paper ...