Cross-utterance Reranking Models with BERT and Graph Convolutional Networks for Conversational Speech Recognition

Abstract

How to effectively incorporate cross-utterance information cues into a neurallanguage model (LM) has emerged as one of the intriguing issues for automaticspeech recognition (ASR). Existing research efforts on improvingcontextualization of an LM typically regard previous utterances as a sequenceof additional input and may fail to capture complex global structuraldependencies among these utterances. In view of this, we in this paper seek torepresent the historical context information of an utterance asgraph-structured data so as to distill cross-utterances, global wordinteraction relationships. To this end, we apply a graph convolutional network(GCN) on the resulting graph to obtain the corresponding GCN embeddings ofhistorical words. GCN has recently found its versatile applications onsocial-network analysis, text summarization, and among others due mainly to itsability of effectively capturing rich relational information among elements.However, GCN remains largely underexplored in the context of ASR, especiallyfor dealing with conversational speech. In addition, we frame ASR N-bestreranking as a prediction problem, leveraging bidirectional encoderrepresentations from transformers (BERT) as the vehicle to not only seize thelocal intrinsic word regularity patterns inherent in a candidate hypothesis butalso incorporate the cross-utterance, historical word interaction cuesdistilled by GCN for promoting performance. Extensive experiments conducted onthe AMI benchmark dataset seem to confirm the pragmatic utility of our methods,in relation to some current top-of-the-line methods.

Quick Read (beta)

loading the full paper ...