Attribute or Abstain: Large Language Models as Long Document Assistants

Abstract

LLMs can help humans working with long documents, but are known tohallucinate. Attribution can increase trust in LLM responses: The LLM providesevidence that supports its response, which enhances verifiability. Existingapproaches to attribution have only been evaluated in RAG settings, where theinitial retrieval confounds LLM performance. This is crucially different fromthe long document setting, where retrieval is not needed, but could help. Thus,a long document specific evaluation of attribution is missing. To fill thisgap, we present LAB, a benchmark of 6 diverse long document tasks withattribution, and experiments with different approaches to attribution on 5 LLMsof different sizes. We find that citation, i.e. response generation and evidence extraction inone step, performs best for large and fine-tuned models, while additionalretrieval can help for small, prompted models. We investigate whether the "Lostin the Middle'' phenomenon exists for attribution, but do not find this. Wealso find that evidence quality can predict response quality on datasets withsimple responses, but not so for complex responses, as models struggle withproviding evidence for complex claims.

Quick Read (beta)

loading the full paper ...