Abstract
As language agents increasingly automate critical tasks, their ability tofollow domain-specific standard operating procedures (SOPs), policies, andconstraints when taking actions and making tool calls becomes essential yetremains underexplored. To address this gap, we develop an automated evaluationpipeline SOPBench with: (1) executable environments containing 167tools/functions across seven customer service domains with service-specificSOPs and rule-based verifiers, (2) an automated test generation frameworkproducing over 900 verified test cases, and (3) an automated evaluationframework to rigorously assess agent adherence from multiple dimensions. Ourapproach transforms each service-specific SOP code program into a directedgraph of executable functions and requires agents to call these functions basedon natural language SOP descriptions. The original code serves as oraclerule-based verifiers to assess compliance, reducing reliance on manualannotations and LLM-based evaluations. We evaluate 18 leading models, andresults show the task is challenging even for top-tier models (like GPT-4o,Claude-3.7-Sonnet), with variances across domains. Reasoning models likeo4-mini-high show superiority while other powerful models perform lesseffectively (pass rates of 30%-50%), and small models (7B, 8B) performsignificantly worse. Additionally, language agents can be easily jailbroken tooverlook SOPs and constraints. Code, data, and over 24k agent trajectories arereleased at https://github.com/Leezekun/SOPBench.