Abstract
Large language models (LLMs) are increasingly used for tasks that requirecomplex reasoning. Most benchmarks focus on final outcomes but overlook theintermediate reasoning steps - such as planning, revision, and decision makingunder resource constraints. We argue that measuring these internal processes isessential for understanding model behavior and improving reliability. Wepropose using strategic games as a natural evaluation environment: closed,rule-based systems with clear states, limited resources, and automaticfeedback. We introduce a framework that evaluates LLMs along three coredimensions: planning, revision, and resource-constrained decision making. Tooperationalize this, we define metrics beyond win rate, includingovercorrection risk rate, correction success rate, improvement slope, andover-budget ratio. In 4320 adversarial rounds across 12 leading models,ChatGPT-o3-mini achieves the top composite score, with a win rate of 74.7percent, a correction success rate of 78.6 percent, and an improvement slope of0.041. By contrast, Qwen-Plus, despite an overcorrection risk rate of 81.6percent, wins only 25.6 percent of its matches - primarily due to excessiveresource use. We also observe a negative correlation between overcorrectionrisk rate and correction success rate (Pearson r = -0.51, p = 0.093),suggesting that more frequent edits do not always improve outcomes. Ourfindings highlight the value of assessing not only what LLMs decide but howthey arrive at those decisions