Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models

Abstract

In natural language processing, it has been observed recently thatgeneralization could be greatly improved by finetuning a large-scale languagemodel pretrained on a large unlabeled corpus. Despite its recent success andwide adoption, finetuning a large pretrained language model on a downstreamtask is prone to degenerate performance when there are only a small number oftraining instances available. In this paper, we introduce a new regularizationtechnique, to which we refer as "mixout", motivated by dropout. Mixoutstochastically mixes the parameters of two models. We show that our mixouttechnique regularizes learning to minimize the deviation from one of the twomodels and that the strength of regularization adapts along the optimizationtrajectory. We empirically evaluate the proposed mixout and its variants onfinetuning a pretrained language model on downstream tasks. More specifically,we demonstrate that the stability of finetuning and the average accuracygreatly increase when we use the proposed approach to regularize finetuning ofBERT on downstream tasks in GLUE.

Quick Read (beta)

loading the full paper ...