Abstract
Large language models (LLMs) have demonstrated exceptional capabilities ingenerating text, images, and video content. However, as context length grows,the computational cost of attention increases quadratically with the number oftokens, presenting significant efficiency challenges. This paper presents ananalysis of various Key-Value (KV) cache compression strategies, offering acomprehensive taxonomy that categorizes these methods by their underlyingprinciples and implementation techniques. Furthermore, we evaluate their impacton performance and inference latency, providing critical insights into theireffectiveness. Our findings highlight the trade-offs involved in KV cachecompression and its influence on handling long-context scenarios, paving theway for more efficient LLM implementations.