Fine-tuning of Language Models with Discriminator

Abstract

Cross-entropy loss is a common choice when it comes to multiclassclassification tasks and language modeling in particular. Minimizing this lossresults in language models of very good quality. We show that it is possible tofine-tune these models and make them perform even better if they are fine-tunedwith sum of cross-entropy loss and reverse Kullback-Leibler divergence. Thelatter is estimated using discriminator network that we train in advance.During fine-tuning we can use this discriminator to figure out if probabilitiesof some words are overestimated and reduce them in this case. The novelapproach that we propose allows us to reach state-of-the-art quality on PennTreeBank: perplexity of the fine-tuned model drops down by more than 0.5 and isnow below 54.0 in standard evaluation setting; however, in dynamic evaluationframework the improvement is much less perceptible. Our fine-tuning algorithmis rather fast and requires almost no hyperparameter tuning. We test it ondifferent datasets including WikiText-2 and large-scale dataset. In the formercase we also reach state-of-the-art results.

Quick Read (beta)

loading the full paper ...