Abstract
Long-context inference for Large Language Models (LLMs) is heavily limited byhigh computational demands. While several existing methods optimize attentioncomputation, they still process the full set of hidden states at each layer,limiting overall efficiency. In this work, we propose SlimInfer, an innovativeframework that aims to accelerate inference by directly pruning less criticalprompt tokens during the forward pass. Our key insight is an informationdiffusion phenomenon: As information from critical tokens propagates throughlayers, it becomes distributed across the entire sequence. This diffusionprocess suggests that LLMs can maintain their semantic integrity when excessivetokens, even including these critical ones, are pruned in hidden states.Motivated by this, SlimInfer introduces a dynamic fine-grained pruningmechanism that accurately removes redundant tokens of hidden state atintermediate layers. This layer-wise pruning naturally enables an asynchronousKV cache manager that prefetches required token blocks without complexpredictors, reducing both memory usage and I/O costs. Extensive experimentsshow that SlimInfer can achieve up to $\mathbf{2.53\times}$ time-to-first-token(TTFT) speedup and $\mathbf{1.88\times}$ end-to-end latency reduction forLLaMA3.1-8B-Instruct on a single RTX 4090, without sacrificing performance onLongBench. Our code will be released upon acceptance.