A Purely End-to-end System for Multi-speaker Speech Recognition

Abstract

Recently, there has been growing interest in multi-speaker speechrecognition, where the utterances of multiple speakers are recognized fromtheir mixture. Promising techniques have been proposed for this task, butearlier works have required additional training data such as isolated sourcesignals or senone alignments for effective learning. In this paper, we proposea new sequence-to-sequence framework to directly decode multiple labelsequences from a single speech sequence by unifying source separation andspeech recognition functions in an end-to-end manner. We further propose a newobjective function to improve the contrast between the hidden vectors to avoidgenerating similar hypotheses. Experimental results show that the model isdirectly able to learn a mapping from a speech mixture to multiple labelsequences, achieving 83.1 % relative improvement compared to a model trainedwithout the proposed objective. Interestingly, the results are comparable tothose produced by previous end-to-end works featuring explicit separation andrecognition modules.

Quick Read (beta)

loading the full paper ...