LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA

Abstract

Though current long-context large language models (LLMs) have demonstratedimpressive capacities in answering user questions based on extensive text, thelack of citations in their responses makes user verification difficult, leadingto concerns about their trustworthiness due to their potential hallucinations.In this work, we aim to enable long-context LLMs to generate responses withfine-grained sentence-level citations, improving their faithfulness andverifiability. We first introduce LongBench-Cite, an automated benchmark forassessing current LLMs' performance in Long-Context Question Answering withCitations (LQAC), revealing considerable room for improvement. To this end, wepropose CoF (Coarse to Fine), a novel pipeline that utilizes off-the-shelf LLMsto automatically generate long-context QA instances with precise sentence-levelcitations, and leverage this pipeline to construct LongCite-45k, a large-scaleSFT dataset for LQAC. Finally, we train LongCite-8B and LongCite-9B using theLongCite-45k dataset, successfully enabling their generation of accurateresponses and fine-grained sentence-level citations in a single output. Theevaluation results on LongBench-Cite show that our trained models achievestate-of-the-art citation quality, surpassing advanced proprietary modelsincluding GPT-4o.

Quick Read (beta)

loading the full paper ...