Speaker Recognition from raw waveform with SincNet

  • 2018-07-29 16:27:19
  • Mirco Ravanelli, Yoshua Bengio
  • 14


Deep learning is progressively gaining popularity as a viable alternative toi-vectors for speaker recognition. Promising results have been recentlyobtained with Convolutional Neural Networks (CNNs) when fed by raw speechsamples directly. Rather than employing standard hand-crafted features, thelatter CNNs learn low-level speech representations from waveforms, potentiallyallowing the network to better capture important narrow-band speakercharacteristics such as pitch and formants. Proper design of the neural networkis crucial to achieve this goal. This paper proposes a novel CNN architecture,called SincNet, that encourages the first convolutional layer to discover moremeaningful filters. SincNet is based on parametrized sinc functions, whichimplement band-pass filters. In contrast to standard CNNs, that learn allelements of each filter, only low and high cutoff frequencies are directlylearned from data with the proposed method. This offers a very compact andefficient way to derive a customized filter bank specifically tuned for thedesired application. Our experiments, conducted on both speaker identificationand speaker verification tasks, show that the proposed architecture convergesfaster and performs better than a standard CNN on raw waveforms.


