Abstract
In the traditional cascading architecture for spoken language understanding(SLU), it has been observed that automatic speech recognition errors could bedetrimental to the performance of natural language understanding. End-to-end(E2E) SLU models have been proposed to directly map speech input to desiredsemantic frame with a single model, hence mitigating ASR error propagation.Recently, pre-training technologies have been explored for these E2E models. Inthis paper, we propose a novel joint textual-phonetic pre-training approach forlearning spoken language representations, aiming at exploring the fullpotentials of phonetic information to improve SLU robustness to ASR errors. Weexplore phoneme labels as high-level speech features, and design and comparepre-training tasks based on conditional masked language model objectives andinter-sentence relation objectives. We also investigate the efficacy ofcombining textual and phonetic information during fine-tuning. Experimentalresults on spoken language understanding benchmarks, Fluent Speech Commands andSNIPS, show that the proposed approach significantly outperforms strongbaseline models and improves robustness of spoken language understanding to ASRerrors.