SPICE: Self-Play In Corpus Environments Improves Reasoning

Abstract

Self-improving systems require environmental interaction for continuousadaptation. We introduce SPICE (Self-Play In Corpus Environments), areinforcement learning framework where a single model acts in two roles: aChallenger that mines documents from a large corpus to generate diversereasoning tasks, and a Reasoner that solves them. Through adversarial dynamics,the Challenger creates an automatic curriculum at the frontier of theReasoner's capability, while corpus grounding provides the rich,near-inexhaustible external signal necessary for sustained improvement. Unlikeexisting ungrounded self-play methods that offer more limited benefits, SPICEachieves consistent gains across mathematical (+8.9%) and general reasoning(+9.8%) benchmarks on multiple model families. Our analysis reveals howdocument grounding is a key ingredient in SPICE to continuously generate itsown increasingly challenging goals and achieve them, enabling sustainedself-improvement.

Quick Read (beta)

loading the full paper ...