Better Prompt Compression Without Multi-Layer Perceptrons

Abstract

Prompt compression is a promising approach to speeding up language modelinference without altering the generative model. Prior works compress promptsinto smaller sequences of learned tokens using an encoder that is trained as aLowRank Adaptation (LoRA) of the inference language model. However, we showthat the encoder does not need to keep the original language model'sarchitecture to achieve useful compression. We introduce the Attention-OnlyCompressor (AOC), which learns a prompt compression encoder after removing themultilayer perceptron (MLP) layers in the Transformer blocks of a languagemodel, resulting in an encoder with roughly 67% less parameters compared to theoriginal model. Intriguingly we find that, across a range of compression ratiosup to 480x, AOC can better regenerate prompts and outperform a baselinecompression encoder that is a LoRA of the inference language model withoutremoving MLP layers. These results demonstrate that the architecture of promptcompression encoders does not need to be identical to that of the originaldecoder language model, paving the way for further research into architecturesand approaches for prompt compression.

Quick Read (beta)

loading the full paper ...