ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations

Abstract

Academic writing requires both coherent text generation and precise citationof relevant literature. Although recent Retrieval-Augmented Generation (RAG)systems have significantly improved factual accuracy in general-purpose textgeneration, their ability to support professional academic writing remainslimited. In this work, we introduce ScholarCopilot, a unified frameworkdesigned to enhance existing large language models for generating professionalacademic articles with accurate and contextually relevant citations.ScholarCopilot dynamically determines when to retrieve scholarly references bygenerating a retrieval token [RET], which is then used to query a citationdatabase. The retrieved references are fed into the model to augment thegeneration process. We jointly optimize both the generation and citation taskswithin a single framework to improve efficiency. Our model is built uponQwen-2.5-7B and trained on 500K papers from arXiv. It achieves a top-1retrieval accuracy of 40.1% on our evaluation dataset, outperforming baselinessuch as E5-Mistral-7B-Instruct (15.0%) and BM25 (9.8%). On a dataset of 1,000academic writing samples, ScholarCopilot scores 16.2/25 in generation quality-- measured across relevance, coherence, academic rigor, completeness, andinnovation -- significantly surpassing all existing models, including muchlarger ones like the Retrieval-Augmented Qwen2.5-72B-Instruct. Human studiesfurther demonstrate that ScholarCopilot, despite being a 7B model,significantly outperforms ChatGPT, achieving 100% preference in citationquality and over 70% in overall usefulness.

Quick Read (beta)

loading the full paper ...