BlockDialect: Block-wise Fine-grained Mixed Format for Energy-Efficient LLM Inference

  • 2025-01-03 09:27:46
  • Wonsuk Jang, Thierry Tambe
  • 0

Abstract

Large Language Models (LLMs) have achieved remarkable success, but theirincreasing size poses significant challenges in memory usage and computationalcosts. Quantizing both weights and activations can address these issues, withfine-grained block-wise quantization emerging as a promising hardware-supportedsolution to mitigate outliers. However, existing methods struggle to capturenuanced block data distributions. To address this, we propose BlockDialect, ablock-wise fine-grained mixed format technique that assigns a per-block optimalnumber format from formatbook for better data representation. Additionally, weintroduce DialectFP4, a formatbook of FP4 variants (akin to dialects) thatadapt to diverse data distributions. To leverage this efficiently, we propose atwo-stage approach for online DialectFP4 activation quantization. Importantly,DialectFP4 ensures hardware efficiency by selecting representable values asscaled integers compatible with low-precision integer arithmetic. BlockDialectachieves 11.83% (7.56%) accuracy gain on the LLaMA3-8B (LLaMA2-7B) modelcompared to MXFP4 format with lower bit usage per data, while being only 5.46%(2.65%) below full precision even when quantizing full-path matrixmultiplication. Focusing on how to represent over how to scale, our workpresents a promising path for energy-efficient LLM inference.

 

Quick Read (beta)

loading the full paper ...