Abstract
A common technique for compressing a neural network is to compute the$k$-rank $\ell_2$ approximation $A_{k,2}$ of the matrix$A\in\mathbb{R}^{n\times d}$ that corresponds to a fully connected layer (orembedding layer). Here, $d$ is the number of the neurons in the layer, $n$ isthe number in the next one, and $A_{k,2}$ can be stored in $O((n+d)k)$ memoryinstead of $O(nd)$. This $\ell_2$-approximation minimizes the sum over every entry to the powerof $p=2$ in the matrix $A - A_{k,2}$, among every matrix$A_{k,2}\in\mathbb{R}^{n\times d}$ whose rank is $k$. While it can be computedefficiently via SVD, the $\ell_2$-approximation is known to be very sensitiveto outliers ("far-away" rows). Hence, machine learning uses e.g. LassoRegression, $\ell_1$-regularization, and $\ell_1$-SVM that use the$\ell_1$-norm. This paper suggests to replace the $k$-rank $\ell_2$ approximation by$\ell_p$, for $p\in [1,2]$. We then provide practical and provableapproximation algorithms to compute it for any $p\geq1$, based on moderntechniques in computational geometry. Extensive experimental results on the GLUE benchmark for compressing BERT,DistilBERT, XLNet, and RoBERTa confirm this theoretical advantage. For example,our approach achieves $28\%$ compression of RoBERTa's embedding layer with only$0.63\%$ additive drop in the accuracy (without fine-tuning) in average overall tasks in GLUE, compared to $11\%$ drop using the existing$\ell_2$-approximation. Open code is provided for reproducing and extending ourresults.