Understanding Synthetic Context Extension via Retrieval Heads

Abstract

Long-context LLMs are increasingly in demand for applications such asretrieval-augmented generation. To defray the cost of pretraining LLMs overlong contexts, recent work takes an approach of synthetic context extension:fine-tuning LLMs with synthetically generated long-context data in apost-training stage. However, it remains unclear how and why this syntheticcontext extension imparts abilities for downstream long-context tasks. In thispaper, we investigate fine-tuning on synthetic data for three long-contexttasks that require retrieval and reasoning. We vary the realism of "needle"concepts to be retrieved and diversity of the surrounding "haystack" context,from using LLMs to construct synthetic documents to using templated relationsand creating symbolic datasets. We find that models trained on synthetic datafall short of the real data, but surprisingly, the mismatch can be interpretedand even predicted in terms of a special set of attention heads that areresponsible for retrieval over long context: retrieval heads (Wu et al., 2024).The retrieval heads learned on synthetic data are mostly subsets of theretrieval heads learned on real data, and there is a strong correlation betweenthe recall of heads learned and the downstream performance of a model.Furthermore, with attention knockout and activation patching, wemechanistically show that retrieval heads are necessary and explain modelperformance, although they are not totally sufficient. Our results shed lighton how to interpret synthetic data fine-tuning performance and how to approachcreating better data for learning real-world capabilities over long contexts.

Quick Read (beta)

loading the full paper ...