Multi-teacher knowledge distillation as an effective method for compressing ensembles of neural networks

Abstract

Deep learning has contributed greatly to many successes in artificialintelligence in recent years. Today, it is possible to train models that havethousands of layers and hundreds of billions of parameters. Large-scale deepmodels have achieved great success, but the enormous computational complexityand gigantic storage requirements make it extremely difficult to implement themin real-time applications. On the other hand, the size of the dataset is stilla real problem in many domains. Data are often missing, too expensive, orimpossible to obtain for other reasons. Ensemble learning is partially asolution to the problem of small datasets and overfitting. However, ensemblelearning in its basic version is associated with a linear increase incomputational complexity. We analyzed the impact of the ensembledecision-fusion mechanism and checked various methods of sharing the decisionsincluding voting algorithms. We used the modified knowledge distillationframework as a decision-fusion mechanism which allows in addition compressingof the entire ensemble model into a weight space of a single model. We showedthat knowledge distillation can aggregate knowledge from multiple teachers inonly one student model and, with the same computational complexity, obtain abetter-performing model compared to a model trained in the standard manner. Wehave developed our own method for mimicking the responses of all teachers atthe same time, simultaneously. We tested these solutions on several benchmarkdatasets. In the end, we presented a wide application use of the efficientmulti-teacher knowledge distillation framework. In the first example, we usedknowledge distillation to develop models that could automate corrosiondetection on aircraft fuselage. The second example describes detection of smokeon observation cameras in order to counteract wildfires in forests.

Quick Read (beta)

loading the full paper ...