Lessons from Training Grounded LLMs with Verifiable Rewards

Abstract

Generating grounded and trustworthy responses remains a key challenge forlarge language models (LLMs). While retrieval-augmented generation (RAG) withcitation-based grounding holds promise, instruction-tuned models frequentlyfail even in straightforward scenarios: missing explicitly stated answers,citing incorrectly, or refusing when evidence is available. In this work, weexplore how reinforcement learning (RL) and internal reasoning can enhancegrounding in LLMs. We use the GRPO (Group Relative Policy Optimization) methodto train models using verifiable outcome-based rewards targeting answercorrectness, citation sufficiency, and refusal quality, without requiring goldreasoning traces or expensive annotations. Through comprehensive experimentsacross ASQA, QAMPARI, ELI5, and ExpertQA we show that reasoning-augmentedmodels significantly outperform instruction-only variants, especially inhandling unanswerable queries and generating well-cited responses. A two-stagetraining setup, first optimizing answer and citation behavior and then refusal,further improves grounding by stabilizing the learning signal. Additionally, werevisit instruction tuning via GPT-4 distillation and find that combining itwith GRPO enhances performance on long-form, generative QA tasks. Overall, ourfindings highlight the value of reasoning, stage-wise optimization, andoutcome-driven RL for building more verifiable and reliable LLMs.

Quick Read (beta)

loading the full paper ...