UR$^2$: Unify RAG and Reasoning through Reinforcement Learning

Abstract

Large Language Models (LLMs) have shown remarkable capabilities through twocomplementary paradigms: Retrieval-Augmented Generation (RAG), which enhancesknowledge grounding, and Reinforcement Learning from Verifiable Rewards (RLVR),which optimizes complex reasoning abilities. However, these two capabilitiesare often developed in isolation, and existing efforts to unify them remainnarrow in scope-typically limited to open-domain QA with fixed retrievalsettings and task-specific assumptions. This lack of integration constrainsgeneralization and limits the applicability of RAG-RL methods to broaderdomains. To bridge this gap, we propose UR2 (Unified RAG and Reasoning), ageneral framework that unifies retrieval and reasoning through reinforcementlearning. UR2 introduces two key contributions: a difficulty-aware curriculumtraining that selectively invokes retrieval only for challenging problems, anda hybrid knowledge access strategy combining domain-specific offline corporawith LLM-generated summaries. These components are designed to enable dynamiccoordination between retrieval and reasoning, improving adaptability across adiverse range of tasks. Experiments across open-domain QA, MMLU-Pro, medical,and mathematical reasoning tasks demonstrate that UR2 (built on Qwen2.5-3/7Band LLaMA-3.1-8B) significantly outperforms existing RAG and RL methods,achieving comparable performance to GPT-4o-mini and GPT-4.1-mini on severalbenchmarks. We have released all code, models, and data athttps://github.com/Tsinghua-dhy/UR2.

Quick Read (beta)

loading the full paper ...