Abstract
The causal capabilities of large language models (LLMs) are a matter ofsignificant debate, with critical implications for the use of LLMs insocietally impactful domains such as medicine, science, law, and policy. Weconduct a "behavorial" study of LLMs to benchmark their capability ingenerating causal arguments. Across a wide range of tasks, we find that LLMscan generate text corresponding to correct causal arguments with highprobability, surpassing the best-performing existing methods. Algorithms basedon GPT-3.5 and 4 outperform existing algorithms on a pairwise causal discoverytask (97%, 13 points gain), counterfactual reasoning task (92%, 20 points gain)and event causality (86% accuracy in determining necessary and sufficientcauses in vignettes). We perform robustness checks across tasks and show thatthe capabilities cannot be explained by dataset memorization alone, especiallysince LLMs generalize to novel datasets that were created after the trainingcutoff date. That said, LLMs exhibit unpredictable failure modes, and we discuss the kindsof errors that may be improved and what are the fundamental limits of LLM-basedanswers. Overall, by operating on the text metadata, LLMs bring capabilities sofar understood to be restricted to humans, such as using collected knowledge togenerate causal graphs or identifying background causal context from naturallanguage. As a result, LLMs may be used by human domain experts to save effortin setting up a causal analysis, one of the biggest impediments to thewidespread adoption of causal methods. Given that LLMs ignore the actual data,our results also point to a fruitful research direction of developingalgorithms that combine LLMs with existing causal techniques. Code and datasetsare available at https://github.com/py-why/pywhy-llm.