BERT-of-Theseus: Compressing BERT by Progressive Module Replacing

Abstract

In this paper, we propose a novel model compression approach to effectivelycompress BERT by progressive module replacing. Our approach first divides theoriginal BERT into several modules and builds their compact substitutes. Then,we randomly replace the original modules with their substitutes to train thecompact modules to mimic the behavior of the original modules. We progressivelyincrease the probability of replacement through the training. In this way, ourapproach brings a deeper level of interaction between the original and compactmodels, and smooths the training process. Compared to the previous knowledgedistillation approaches for BERT compression, our approach leverages only oneloss function and one hyper-parameter, liberating human effort fromhyper-parameter tuning. Our approach outperforms existing knowledgedistillation approaches on GLUE benchmark, showing a new perspective of modelcompression.

Quick Read (beta)

loading the full paper ...