Abstract
Factorized layers--operations parameterized by products of two or morematrices--occur in a variety of deep learning contexts, including compressedmodel training, certain types of knowledge distillation, and multi-headself-attention architectures. We study how to initialize and regularize deepnets containing such layers, examining two simple, understudied schemes,spectral initialization and Frobenius decay, for improving their performance.The guiding insight is to design optimization routines for these networks thatare as close as possible to that of their well-tuned, non-decomposedcounterparts; we back this intuition with an analysis of how the initializationand regularization schemes impact training with gradient descent, drawing onmodern attempts to understand the interplay of weight-decay andbatch-normalization. Empirically, we highlight the benefits of spectralinitialization and Frobenius decay across a variety of settings. In modelcompression, we show that they enable low-rank methods to significantlyoutperform both unstructured sparsity and tensor methods on the task oftraining low-memory residual networks; analogs of the schemes also improve theperformance of tensor decomposition techniques. For knowledge distillation,Frobenius decay enables a simple, overcomplete baseline that yields a compactmodel from over-parameterized training without requiring retraining with orpruning a teacher network. Finally, we show how both schemes applied tomulti-head attention lead to improved performance on both translation andunsupervised pre-training.