Abstract
Cross-entropy loss is a common choice when it comes to multiclassclassification tasks and language modeling in particular. Minimizing this lossresults in language models of very good quality. We show that it is possible tofine-tune these models and make them perform even better if they are fine-tunedwith sum of cross-entropy loss and reverse Kullback-Leibler divergence. Thelatter is estimated using discriminator network that we train in advance.During fine-tuning probabilities of rare words that are usually underestimatedby language models become bigger. The novel approach that we propose allows usto reach state-of-the-art quality on Penn Treebank: perplexity decreases from52.4 to 52.1. Our fine-tuning algorithm is rather fast, scales well todifferent architectures and datasets and requires almost no hyperparametertuning: the only hyperparameter that needs to be tuned is learning rate.