Self-Knowledge Distillation in Natural Language Processing

Abstract

Since deep learning became a key player in natural language processing (NLP),many deep learning models have been showing remarkable performances in avariety of NLP tasks, and in some cases, they are even outperforming humans.Such high performance can be explained by efficient knowledge representation ofdeep learning models. While many methods have been proposed to learn moreefficient representation, knowledge distillation from pretrained deep networkssuggest that we can use more information from the soft target probability totrain other neural networks. In this paper, we propose a new knowledgedistillation method self-knowledge distillation, based on the soft targetprobabilities of the training model itself, where multimode information isdistilled from the word embedding space right below the softmax layer. Due tothe time complexity, our method approximates the soft target probabilities. Inexperiments, we applied the proposed method to two different and fundamentalNLP tasks: language model and neural machine translation. The experimentresults show that our proposed method improves performance on the tasks.

Quick Read (beta)

loading the full paper ...