Born Again Neural Networks

Abstract

Knowledge distillation (KD) consists of transferring knowledge from onemachine learning model (the teacher}) to another (the student). Commonly, theteacher is a high-capacity model with formidable performance, while the studentis more compact. By transferring knowledge, one hopes to benefit from thestudent's compactness. %we desire a compact model with performance close to theteacher's. We study KD from a new perspective: rather than compressing models,we train students parameterized identically to their teachers. Surprisingly,these {Born-Again Networks (BANs), outperform their teachers significantly,both on computer vision and language modeling tasks. Our experiments with BANsbased on DenseNets demonstrate state-of-the-art performance on the CIFAR-10(3.5%) and CIFAR-100 (15.5%) datasets, by validation error. Additionalexperiments explore two distillation objectives: (i) Confidence-Weighted byTeacher Max (CWTM) and (ii) Dark Knowledge with Permuted Predictions (DKPP).Both methods elucidate the essential components of KD, demonstrating a role ofthe teacher outputs on both predicted and non-predicted classes. We presentexperiments with students of various capacities, focusing on the under-exploredcase where students overpower teachers. Our experiments show significantadvantages from transferring knowledge between DenseNets and ResNets in eitherdirection.

Quick Read (beta)

loading the full paper ...