Building competitive direct acoustics-to-word models for English conversational speech recognition

Abstract

Direct acoustics-to-word (A2W) models in the end-to-end paradigm havereceived increasing attention compared to conventional sub-word based automaticspeech recognition models using phones, characters, or context-dependent hiddenMarkov model states. This is because A2W models recognize words from speechwithout any decoder, pronunciation lexicon, or externally-trained languagemodel, making training and decoding with such models simple. Prior work hasshown that A2W models require orders of magnitude more training data in orderto perform comparably to conventional models. Our work also showed thisaccuracy gap when using the English Switchboard-Fisher data set. This paperdescribes a recipe to train an A2W model that closes this gap and is at-parwith state-of-the-art sub-word based models. We achieve a word error rate of8.8%/13.9% on the Hub5-2000 Switchboard/CallHome test sets without any decoderor language model. We find that model initialization, training data order, andregularization have the most impact on the A2W model performance. Next, wepresent a joint word-character A2W model that learns to first spell the wordand then recognize it. This model provides a rich output to the user instead ofsimple word hypotheses, making it especially useful in the case of words unseenor rarely-seen during training.

Quick Read (beta)

loading the full paper ...