Q-Sparse: All Large Language Models can be Fully Sparsely-Activated

Abstract

We introduce, Q-Sparse, a simple yet effective approach to trainingsparsely-activated large language models (LLMs). Q-Sparse enables full sparsityof activations in LLMs which can bring significant efficiency gains ininference. This is achieved by applying top-K sparsification to the activationsand the straight-through-estimator to the training. The key results from thiswork are, (1) Q-Sparse can achieve results comparable to those of baseline LLMswhile being much more efficient at inference time; (2) We present aninference-optimal scaling law for sparsely-activated LLMs; (3) Q-Sparse iseffective in different settings, including training-from-scratch,continue-training of off-the-shelf LLMs, and finetuning; (4) Q-Sparse works forboth full-precision and 1-bit LLMs (e.g., BitNet b1.58). Particularly, thesynergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides thecornerstone and a clear path to revolutionize the efficiency, including costand energy consumption, of future LLMs.

Quick Read (beta)

loading the full paper ...