Abstract
State-of-the-art neural language models (LMs) represented by Transformers arehighly complex. Their use of fixed, deterministic parameter estimates fail toaccount for model uncertainty and lead to over-fitting and poor generalizationwhen given limited training data. In order to address these issues, this paperproposes a full Bayesian learning framework for Transformer LM estimation.Efficient variational inference based approaches are used to estimate thelatent parameter posterior distributions associated with different parts of theTransformer model architecture including multi-head self-attention, feedforward and embedding layers. Statistically significant word error rate (WER)reductions up to 0.5\% absolute (3.18\% relative) and consistent perplexitygains were obtained over the baseline Transformer LMs on state-of-the-artSwitchboard corpus trained LF-MMI factored TDNN systems with i-Vector speakeradaptation. Performance improvements were also obtained on a cross domain LMadaptation task requiring porting a Transformer LM trained on the Switchboardand Fisher data to a low-resource DementiaBank elderly speech corpus.