Abstract
Retrieval-Augmented Generation (RAG) has emerged as a powerful framework toimprove factuality in large language models (LLMs) by grounding their outputsin retrieved documents. However, ensuring perfect retrieval of relevantinformation remains challenging, and when irrelevant content is passeddownstream to an LLM, it can lead to hallucinations. In this work, we proposeFinetune-RAG, a simple and effective fine-tuning approach that features thefirst-of-its-kind RAG training dataset constructed to mimic real-worldimperfections. Experimental results show that Finetune-RAG improves factualaccuracy by 21.2% over the base model. We also propose Bench-RAG, anLLM-as-a-judge evaluation pipeline that stress tests models under realisticimperfect retrieval scenarios. Our codebase and dataset are fully open sourcedfor community use.