Semantic Caching of Contextual Summaries for Efficient Question-Answering with Language Models

Abstract

Large Language Models (LLMs) are increasingly deployed across edge and cloudplatforms for real-time question-answering and retrieval-augmented generation.However, processing lengthy contexts in distributed systems incurs highcomputational overhead, memory usage, and network bandwidth. This paperintroduces a novel semantic caching approach for storing and reusingintermediate contextual summaries, enabling efficient information reuse acrosssimilar queries in LLM-based QA workflows. Our method reduces redundantcomputations by up to 50-60% while maintaining answer accuracy comparable tofull document processing, as demonstrated on NaturalQuestions, TriviaQA, and asynthetic ArXiv dataset. This approach balances computational cost and responsequality, critical for real-time AI assistants.

Quick Read (beta)

loading the full paper ...