TensorSLM: Energy-efficient Embedding Compression of Sub-billion Parameter Language Models on Low-end Devices

Abstract

Small Language Models (SLMs, or on-device LMs) have significantly fewerparameters than Large Language Models (LLMs). They are typically deployed onlow-end devices, like mobile phones and single-board computers. Unlike LLMs,which rely on increasing model size for better generalisation, SLMs designedfor edge applications are expected to have adaptivity to the deploymentenvironments and energy efficiency given the device battery life constraints,which are not addressed in datacenter-deployed LLMs. This paper addresses thesetwo requirements by proposing a training-free token embedding compressionapproach using Tensor-Train Decomposition (TTD). Each pre-trained tokenembedding vector is converted into a lower-dimensional Matrix Product State(MPS). We comprehensively evaluate the extracted low-rank structures acrosscompression ratio, language task performance, latency, and energy consumptionon a typical low-end device, i.e. Raspberry Pi. Taking the sub-billionparameter versions of GPT-2/Cerebres-GPT and OPT models as examples, ourapproach achieves a comparable language task performance to the original modelwith around $2.0\times$ embedding layer compression, while the energyconsumption of a single query drops by half.

Quick Read (beta)

loading the full paper ...