VerIF: Verification Engineering for Reinforcement Learning in Instruction Following

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a keytechnique for enhancing large language models (LLMs), with verificationengineering playing a central role. However, best practices for RL ininstruction following remain underexplored. In this work, we explore theverification challenge in RL for instruction following and propose VerIF, averification method that combines rule-based code verification with LLM-basedverification from a large reasoning model (e.g., QwQ-32B). To support thisapproach, we construct a high-quality instruction-following dataset,VerInstruct, containing approximately 22,000 instances with associatedverification signals. We apply RL training with VerIF to two models, achievingsignificant improvements across several representative instruction-followingbenchmarks. The trained models reach state-of-the-art performance among modelsof comparable size and generalize well to unseen constraints. We furtherobserve that their general capabilities remain unaffected, suggesting that RLwith VerIF can be integrated into existing RL recipes to enhance overall modelperformance. We have released our datasets, codes, and models to facilitatefuture research at https://github.com/THU-KEG/VerIF.

Quick Read (beta)

loading the full paper ...