Cache-to-Cache: Direct Semantic Communication Between Large Language Models

Abstract

Multi-LLM systems harness the complementary strengths of diverse LargeLanguage Models, achieving performance and efficiency gains unattainable by asingle model. In existing designs, LLMs communicate through text, forcinginternal representations to be transformed into output token sequences. Thisprocess both loses rich semantic information and incurs token-by-tokengeneration latency. Motivated by these limitations, we ask: Can LLMscommunicate beyond text? Oracle experiments show that enriching the KV-Cachesemantics can improve response quality without increasing cache size,supporting KV-Cache as an effective medium for inter-model communication. Thus,we propose Cache-to-Cache (C2C), a new paradigm for direct semanticcommunication between LLMs. C2C uses a neural network to project and fuse thesource model's KV-cache with that of the target model to enable direct semantictransfer. A learnable gating mechanism selects the target layers that benefitfrom cache communication. Compared with text communication, C2C utilizes thedeep, specialized semantics from both models, while avoiding explicitintermediate text generation. Experiments show that C2C achieves 8.5-10.5%higher average accuracy than individual models. It further outperforms the textcommunication paradigm by approximately 3.0-5.0%, while delivering an average2.0x speedup in latency. Our code is available athttps://github.com/thu-nics/C2C.

Quick Read (beta)

loading the full paper ...