Abstract
In this work, we study how the large-scale pretrain-finetune frameworkchanges the behavior of a neural language generator. We focus on thetransformer encoder-decoder model for the open-domain dialogue responsegeneration task. We find that after standard fine-tuning, the model forgetsimportant language generation skills acquired during large-scale pre-training.We demonstrate the forgetting phenomenon through a detailed behavior analysisfrom the perspectives of context sensitivity and knowledge transfer. Adoptingthe concept of data mixing, we propose an intuitive fine-tuning strategy named"mix-review". We find that mix-review effectively regularize the fine-tuningprocess, and the forgetting problem is largely alleviated. Finally, we discussinteresting behavior of the resulting dialogue model and its implications.