Abstract
With the rise of artificial intelligence (AI), applying large language models(LLMs) to Operations Research (OR) problem-solving has attracted increasingattention. Most existing approaches attempt to improve OR problem-solvingthrough prompt engineering or fine-tuning strategies for LLMs. However, thesemethods are fundamentally constrained by the limited capabilities ofnon-reasoning LLMs. To overcome these limitations, we propose OR-LLM-Agent, anAI agent built on reasoning LLMs for automated OR problem solving. The agentdecomposes the task into three sequential stages: mathematical modeling, codegeneration, and debugging. Each task is handled by a dedicated sub-agent, whichenables more targeted reasoning. We also construct BWOR, a high-quality datasetfor evaluating LLM performance on OR tasks. Our analysis shows that existingbenchmarks such as NL4OPT, MAMO, and IndustryOR suffer from certain issues,making them less suitable for reliably evaluating LLM performance. In contrast,BWOR provides a more consistent and discriminative assessment of modelcapabilities. Experimental results demonstrate that OR-LLM-Agent outperformsadvanced methods, including GPT-o3, Gemini 2.5 Pro, and ORLM, by at least 7% inaccuracy. These results demonstrate the effectiveness of task decomposition forOR problem solving.