FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

Abstract

Multi-step symbolic reasoning is essential for robust financial analysis;yet, current benchmarks largely overlook this capability. Existing datasetssuch as FinQA and ConvFinQA emphasize final numerical answers while neglectingthe intermediate reasoning required for transparency and verification. Toaddress this gap, we introduce FinChain, the first benchmark specificallydesigned for verifiable Chain-of-Thought (CoT) evaluation in finance. FinChainspans 58 topics across 12 financial domains, each represented by parameterizedsymbolic templates with executable Python traces that enable fullymachine-verifiable reasoning and scalable, contamination-free data generation.To assess reasoning capacity, we propose ChainEval, a dynamic alignment metricthat jointly evaluates both the final-answer correctness and the step-levelreasoning consistency. Evaluating 26 leading LLMs reveals that even frontierproprietary systems exhibit clear limitations in symbolic financial reasoning,while domain-adapted and math-enhanced fine-tuned models substantially narrowthis gap. Overall, FinChain exposes persistent weaknesses in multi-stepfinancial reasoning and provides a foundation for developing trustworthy,interpretable, and verifiable financial AI.

Quick Read (beta)

loading the full paper ...