Do Large Language Models Reason Causally Like Us? Even Better?

Abstract

Causal reasoning is a core component of intelligence. Large language models(LLMs) have shown impressive capabilities in generating human-like text,raising questions about whether their responses reflect true understanding orstatistical patterns. We compared causal reasoning in humans and four LLMsusing tasks based on collider graphs, rating the likelihood of a query variableoccurring given evidence from other variables. LLMs' causal inferences rangedfrom often nonsensical (GPT-3.5) to human-like to often more normativelyaligned than those of humans (GPT-4o, Gemini-Pro, and Claude). Computationalmodel fitting showed that one reason for GPT-4o, Gemini-Pro, and Claude'ssuperior performance is they didn't exhibit the "associative bias" that plagueshuman causal reasoning. Nevertheless, even these LLMs did not fully capturesubtler reasoning patterns associated with collider graphs, such as "explainingaway".

Quick Read (beta)

loading the full paper ...