Scalable MatMul-free Language Modeling

Abstract

Matrix multiplication (MatMul) typically dominates the overall computationalcost of large language models (LLMs). This cost only grows as LLMs scale tolarger embedding dimensions and context lengths. In this work, we show thatMatMul operations can be completely eliminated from LLMs while maintainingstrong performance at billion-parameter scales. Our experiments show that ourproposed MatMul-free models achieve performance on-par with state-of-the-artTransformers that require far more memory during inference at a scale up to atleast 2.7B parameters. We investigate the scaling laws and find that theperformance gap between our MatMul-free models and full precision Transformersnarrows as the model size increases. We also provide a GPU-efficientimplementation of this model which reduces memory usage by up to 61% over anunoptimized baseline during training. By utilizing an optimized kernel duringinference, our model's memory consumption can be reduced by more than 10xcompared to unoptimized models. To properly quantify the efficiency of ourarchitecture, we build a custom hardware solution on an FPGA which exploitslightweight operations beyond what GPUs are capable of. We processedbillion-parameter scale models at 13W beyond human readable throughput, movingLLMs closer to brain-like efficiency. This work not only shows how far LLMs canbe stripped back while still performing effectively, but also points at thetypes of operations future accelerators should be optimized for in processingthe next generation of lightweight LLMs. Our code implementation is availableat \url{https://github.com/ridgerchu/matmulfreellm}.

Quick Read (beta)

loading the full paper ...