Sequence Discriminative Training for Deep Learning based Acoustic Keyword Spotting

Abstract

Speech recognition is a sequence prediction problem. Besides employingvarious deep learning approaches for framelevel classification, sequence-leveldiscriminative training has been proved to be indispensable to achieve thestate-of-the-art performance in large vocabulary continuous speech recognition(LVCSR). However, keyword spotting (KWS), as one of the most common speechrecognition tasks, almost only benefits from frame-level deep learning due tothe difficulty of getting competing sequence hypotheses. The few studies onsequence discriminative training for KWS are limited for fixed vocabulary orLVCSR based methods and have not been compared to the state-of-the-art deeplearning based KWS approaches. In this paper, a sequence discriminativetraining framework is proposed for both fixed vocabulary and unrestrictedacoustic KWS. Sequence discriminative training for both sequence-levelgenerative and discriminative models are systematically investigated. Byintroducing word-independent phone lattices or non-keyword blank symbols toconstruct competing hypotheses, feasible and efficient sequence discriminativetraining approaches are proposed for acoustic KWS. Experiments showed that theproposed approaches obtained consistent and significant improvement in bothfixed vocabulary and unrestricted KWS tasks, compared to previous frame-leveldeep learning based acoustic KWS methods.

Quick Read (beta)

loading the full paper ...