Autoregressive Knowledge Distillation through Imitation Learning

Abstract

The performance of autoregressive models on natural language generation taskshas dramatically improved due to the adoption of deep, self-attentivearchitectures. However, these gains have come at the cost of hinderinginference speed, making state-of-the-art models cumbersome to deploy inreal-world, time-sensitive settings. We develop a compression technique forautoregressive models that is driven by an imitation learning perspective onknowledge distillation. The algorithm is designed to address the exposure biasproblem. On prototypical language generation tasks such as translation andsummarization, our method consistently outperforms other distillationalgorithms, such as sequence-level knowledge distillation. Student modelstrained with our method attain 1.4 to 4.8 BLEU/ROUGE points higher than thosetrained from scratch, while increasing inference speed by up to 14 times incomparison to the teacher model.

Quick Read (beta)

loading the full paper ...