Abstract
Large Language Models (LLMs) have pushed the frontier of artificialintelligence but are comprised of hundreds of billions of parameters andoperations. For faster inference latency, LLMs are deployed on multiplehardware accelerators through various Model Parallelism strategies. Our paperlooks into the details on one such strategy - Tensor Parallel - and proposes toreduce latency by compressing inter-accelerator communication. We leverage finegrained quantization techniques to compress selected activations by 3.5 - 4.5x.Our proposed method leads up to 2x reduction of time-to-first-token (TTFT) withnegligible model performance degradation.
Quick Read (beta)
loading the full paper ...