Characterizing Prompt Compression Methods for Long Context Inference

Abstract

Long context inference presents challenges at the system level with increasedcompute and memory requirements, as well as from an accuracy perspective inbeing able to reason over long contexts. Recently, several methods have beenproposed to compress the prompt to reduce the context length. However, therehas been little work on comparing the different proposed methods acrossdifferent tasks through a standardized analysis. This has led to conflictingresults. To address this, here we perform a comprehensive characterization andevaluation of different prompt compression methods. In particular, we analyzeextractive compression, summarization-based abstractive compression, and tokenpruning methods. Surprisingly, we find that extractive compression oftenoutperforms all the other approaches, and enables up to 10x compression withminimal accuracy degradation. Interestingly, we also find that despite severalrecent claims, token pruning methods often lag behind extractive compression.We only found marginal improvements on summarization tasks.

Quick Read (beta)

loading the full paper ...