FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference

Abstract

The large number of parameters in Pretrained Language Models enhance theirperformance, but also make them resource-intensive, making it challenging todeploy them on commodity hardware like a single GPU. Due to the memory andpower limitations of these devices, model compression techniques are often usedto decrease both the model's size and its inference latency. This usuallyresults in a trade-off between model accuracy and efficiency. Therefore,optimizing this balance is essential for effectively deploying LLMs oncommodity hardware. A significant portion of the efficiency challenge is theFeed-forward network (FFN) component, which accounts for roughly $\frac{2}{3}$total parameters and inference latency. In this paper, we first observe thatonly a few neurons of FFN module have large output norm for any input tokens,a.k.a. heavy hitters, while the others are sparsely triggered by differenttokens. Based on this observation, we explicitly split the FFN into two partsaccording to the heavy hitters. We improve the efficiency-accuracy trade-off ofexisting compression methods by allocating more resource to FFN parts withheavy hitters. In practice, our method can reduce model size by 43.1\% andbring $1.25\sim1.56\times$ wall clock time speedup on different hardware withnegligible accuracy drop.

Quick Read (beta)

loading the full paper ...