Abstract
Instruction following has catalyzed the recent era of Large Language Models(LLMs) and is the foundational skill underpinning more advanced capabilitiessuch as reasoning and agentic behaviors. As tasks grow more challenging, thelogic structures embedded in natural language instructions becomes increasinglyintricate. However, how well LLMs perform on such logic-rich instructionsremains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is ascalable, automated framework for generating verifiable instructions from codefunctions, which can naturally express rich logic such as conditionals,nesting, recursion, and function calls. We further curate a collection ofcomplex code functions and use LogicIFGen to construct LogicIFEval, a benchmarkcomprising 426 verifiable logic-rich instructions. Our experiments demonstratethat current state-of-the-art LLMs still struggle to correctly follow theinstructions in LogicIFEval. Most LLMs can only follow fewer than 60% of theinstructions, revealing significant deficiencies in the instruction-followingability. Code and Benchmark: https://github.com/mianzhang/LogicIF