Pre-training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning

Abstract

In the traditional cascading architecture for spoken language understanding(SLU), it has been observed that automatic speech recognition errors could bedetrimental to the performance of natural language understanding. End-to-end(E2E) SLU models have been proposed to directly map speech input to desiredsemantic frame with a single model, hence mitigating ASR error propagation.Recently, pre-training technologies have been explored for these E2E models. Inthis paper, we propose a novel joint textual-phonetic pre-training approach forlearning spoken language representations, aiming at exploring the fullpotentials of phonetic information to improve SLU robustness to ASR errors. Weexplore phoneme labels as high-level speech features, and design and comparepre-training tasks based on conditional masked language model objectives andinter-sentence relation objectives. We also investigate the efficacy ofcombining textual and phonetic information during fine-tuning. Experimentalresults on spoken language understanding benchmarks, Fluent Speech Commands andSNIPS, show that the proposed approach significantly outperforms strongbaseline models and improves robustness of spoken language understanding to ASRerrors.

Quick Read (beta)

loading the full paper ...