GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking

Abstract

Model compression is essential for serving large deep neural nets on deviceswith limited resources or applications that require real-time responses. As acase study, a state-of-the-art neural language model usually consists of one ormore recurrent layers sandwiched between an embedding layer used forrepresenting input tokens and a softmax layer for generating output tokens. Forproblems with a very large vocabulary size, the embedding and the softmaxmatrices can account for more than half of the model size. For instance, thebigLSTM model achieves state-of- the-art performance on the One-Billion-Word(OBW) dataset with around 800k vocabulary, and its word embedding and softmaxmatrices use more than 6GBytes space, and are responsible for over 90% of themodel parameters. In this paper, we propose GroupReduce, a novel compressionmethod for neural language models, based on vocabulary-partition (block) basedlow-rank matrix approximation and the inherent frequency distribution of tokens(the power-law distribution of words). The experimental results show our methodcan significantly outperform traditional compression methods such as low-rankapproximation and pruning. On the OBW dataset, our method achieved 6.6 timescompression rate for the embedding and softmax matrices, and when combined withquantization, our method can achieve 26 times compression rate, whichtranslates to a factor of 12.8 times compression for the entire model with verylittle degradation in perplexity.

Quick Read (beta)

loading the full paper ...