Abstract
As the foundation of large language models (LLMs), self-attention modulefaces the challenge of quadratic time and memory complexity with respect tosequence length. FlashAttention accelerates attention computation and reducesits memory usage by leveraging the GPU memory hierarchy. A promising researchdirection is to integrate FlashAttention with quantization methods. This paperintroduces INT-FlashAttention, the first INT8 quantization architecturecompatible with the forward workflow of FlashAttention, which significantlyimproves the inference speed of FlashAttention on Ampere GPUs. We implement ourINT-FlashAttention prototype with fully INT8 activations and generalmatrix-multiplication (GEMM) kernels, making it the first attention operatorwith fully INT8 input. As a general token-level post-training quantizationframework, INT-FlashAttention is also compatible with other data formats likeINT4, etc. Experimental results show INT-FlashAttention achieves 72% fasterinference speed and 82% smaller quantization error compared to standardFlashAttention with FP16 and FP8 data format.