Abstract
Language models (LMs) pre-trained on massive amounts of text, in particularbidirectional encoder representations from Transformers (BERT), generativepre-training (GPT), and GPT-2, have become a key technology for many naturallanguage processing tasks. In this paper, we present results using fine-tunedGPT, GPT-2, and their combination for automatic speech recognition (ASR).Unlike unidirectional LM GPT and GPT-2, BERT is bidirectional whose directproduct of the output probabilities is no longer a valid language priorprobability. A conversion method is proposed to compute the correct languageprior probability based on bidirectional LM outputs in a mathematically exactway. Experimental results on the widely used AMI and Switchboard ASR tasksshowed that the combination of the fine-tuned GPT and GPT-2 outperformed thecombination of three neural LMs with different architectures trained fromscratch on the in-domain text by up to a 12% relative word error rate reduction(WERR). Furthermore, on the AMI corpus, the proposed conversion for languageprior probabilities enables BERT to obtain an extra 3% relative WERR, and thecombination of BERT, GPT and GPT-2 results in further improvements.