Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation

Abstract

Despite the successes of large language models (LLMs), they exhibitsignificant drawbacks, particularly when processing long contexts. Theirinference cost scales quadratically with respect to sequence length, making itexpensive for deployment in some real-world text processing applications, suchas retrieval-augmented generation (RAG). Additionally, LLMs also exhibit the"distraction phenomenon", where irrelevant context in the prompt degradesoutput quality. To address these drawbacks, we propose a novel RAG promptingmethodology, *superposition prompting*, which can be directly applied topre-trained transformer-based LLMs *without the need for fine-tuning*. At ahigh level, superposition prompting allows the LLM to process input documentsin parallel *prompt paths*, discarding paths once they are deemed irrelevant.We demonstrate the capability of our method to simultaneously enhance timeefficiency across a variety of question-answering benchmarks using multiplepre-trained LLMs. Furthermore, our technique significantly improves accuracywhen the retrieved context is large relative the context the model was trainedon. For example, our approach facilitates a 93x reduction in compute time while*improving* accuracy by 43% on the NaturalQuestions-Open dataset with theMPT-7B instruction-tuned model over naive RAG.

Quick Read (beta)

loading the full paper ...