BlockDialect: Block-wise Fine-grained Mixed Format for Energy-Efficient LLM Inference

Abstract

Large Language Models (LLMs) have achieved remarkable success, but theirincreasing size poses significant challenges in memory usage and computationalcosts. Quantizing both weights and activations can address these issues, withfine-grained block-wise quantization emerging as a promising hardware-supportedsolution to mitigate outliers. However, existing methods struggle to capturenuanced block data distributions. To address this, we propose BlockDialect, ablock-wise fine-grained mixed format technique that assigns a per-block optimalnumber format from formatbook for better data representation. Additionally, weintroduce DialectFP4, a formatbook of FP4 variants (akin to dialects) thatadapt to diverse data distributions. To leverage this efficiently, we propose atwo-stage approach for online DialectFP4 activation quantization. Importantly,DialectFP4 ensures hardware efficiency by selecting representable values asscaled integers compatible with low-precision integer arithmetic. BlockDialectachieves 11.83% (7.56%) accuracy gain on the LLaMA3-8B (LLaMA2-7B) modelcompared to MXFP4 format with lower bit usage per data, while being only 5.46%(2.65%) below full precision even when quantizing full-path matrixmultiplication. Focusing on how to represent over how to scale, our workpresents a promising path for energy-efficient LLM inference.

Quick Read (beta)

loading the full paper ...