Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models

Abstract

Existing large language models (LLMs) face challenges of following complexinstructions, especially when multiple constraints are present and organized inparalleling, chaining, and branching structures. One intuitive solution, namelychain-of-thought (CoT), is expected to universally improve capabilities ofLLMs. However, we find that the vanilla CoT exerts a negative impact onperformance due to its superficial reasoning pattern of simply paraphrasing theinstructions. It fails to peel back the compositions of constraints foridentifying their relationship across hierarchies of types and dimensions. Tothis end, we propose a systematic method to boost LLMs in dealing with complexinstructions via incentivizing reasoning for test-time compute scaling. First,we stem from the decomposition of complex instructions under existingtaxonomies and propose a reproducible data acquisition method. Second, weexploit reinforcement learning (RL) with verifiable rule-centric reward signalsto cultivate reasoning specifically for instruction following. We address theshallow, non-essential nature of reasoning under complex instructions viasample-wise contrast for superior CoT enforcement. We also exploit behaviorcloning of experts to facilitate steady distribution shift from fast-thinkingLLMs to skillful reasoners. Extensive evaluations on seven comprehensivebenchmarks confirm the validity of the proposed method, where a 1.5B LLMachieves 11.74% gains with performance comparable to a 8B LLM. Codes and datawill be available later (under review). Keywords: reinforcement learning with verifiable rewards (RLVR), instructionfollowing, complex instructions

Quick Read (beta)

loading the full paper ...