On the limit of English conversational speech recognition

Abstract

In our previous work we demonstrated that a single headed attentionencoder-decoder model is able to reach state-of-the-art results inconversational speech recognition. In this paper, we further improve theresults for both Switchboard 300 and 2000. Through use of an improvedoptimizer, speaker vector embeddings, and alternative speech representations wereduce the recognition errors of our LSTM system on Switchboard-300 by 4%relative. Compensation of the decoder model with the probability ratio approachallows more efficient integration of an external language model, and we report5.9% and 11.5% WER on the SWB and CHM parts of Hub5'00 with very simple LSTMmodels. Our study also considers the recently proposed conformer, and moreadvanced self-attention based language models. Overall, the conformer showssimilar performance to the LSTM; nevertheless, their combination and decodingwith an improved LM reaches a new record on Switchboard-300, 5.0% and 10.0% WERon SWB and CHM. Our findings are also confirmed on Switchboard-2000, and a newstate of the art is reported, practically reaching the limit of the benchmark.

Quick Read (beta)

loading the full paper ...