Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships

Abstract

Causal reasoning is fundamental for Large Language Models (LLMs) tounderstand genuine cause-and-effect relationships beyond pattern matching.Existing benchmarks suffer from critical limitations such as reliance onsynthetic data and narrow domain coverage. We introduce a novel benchmarkconstructed from casually identified relationships extracted from top-tiereconomics and finance journals, drawing on rigorous methodologies includinginstrumental variables, difference-in-differences, and regression discontinuitydesigns. Our benchmark comprises 40,379 evaluation items covering five tasktypes across domains such as health, environment, technology, law, and culture.Experimental results on eight state-of-the-art LLMs reveal substantiallimitations, with the best model achieving only 57.6\% accuracy. Moreover,model scale does not consistently translate to superior performance, and evenadvanced reasoning models struggle with fundamental causal relationshipidentification. These findings underscore a critical gap between current LLMcapabilities and demands of reliable causal reasoning in high-stakesapplications.

Quick Read (beta)

loading the full paper ...