Abstract
Large language models (LLMs) often benefit from verbalized reasoning atinference time, but it remains unclear which aspects of task difficulty theseextra reasoning tokens address. To investigate this question, we formalize aframework using deterministic finite automata (DFAs). DFAs offer a formalismthrough which we can characterize task complexity through measurable propertiessuch as run length (number of reasoning steps required) and state-space size(decision complexity). We first show that across different tasks and models ofdifferent sizes and training paradigms, there exists an optimal amount ofreasoning tokens such that the probability of producing a correct solution ismaximized. We then investigate which properties of complexity govern thiscritical length: we find that task instances with longer correspondingunderlying DFA runs (i.e. demand greater latent state-tracking requirements)correlate with longer reasoning lengths, but, surprisingly, that DFA size (i.e.state-space complexity) does not. We then demonstrate an implication of thesefindings: being able to predict the optimal number of reasoning tokens for newproblems and filtering out non-optimal length answers results in consistentaccuracy improvements.