Abstract
We rerank with scores from pretrained masked language models like BERT toimprove ASR and NMT performance. These log-pseudolikelihood scores (LPLs) canoutperform large, autoregressive language models (GPT-2) in out-of-the-boxscoring. RoBERTa reduces WER by up to 30% relative on an end-to-end LibriSpeechsystem and adds up to +1.7 BLEU on state-of-the-art baselines for TED Talkslow-resource pairs, with further gains from domain adaptation. In themultilingual setting, a single XLM can be used to rerank translation outputs inmultiple languages. The numerical and qualitative properties of LPL scoressuggest that LPLs capture sentence fluency better than autoregressive scores.Finally, we finetune BERT to estimate sentence LPLs without masking, enablingscoring in a single, non-recurrent inference pass.