Abstract
Recent studies suggest large language models (LLMs) can exhibit human-likereasoning, aligning with human behavior in economic experiments, surveys, andpolitical discourse. This has led many to propose that LLMs can be used assurrogates or simulations for humans in social science research. However, LLMsdiffer fundamentally from humans, relying on probabilistic patterns, absent theembodied experiences or survival objectives that shape human cognition. Weassess the reasoning depth of LLMs using the 11-20 money request game. Nearlyall advanced approaches fail to replicate human behavior distributions acrossmany models. Causes of failure are diverse and unpredictable, relating to inputlanguage, roles, and safeguarding. These results advise caution when using LLMsto study human behavior or as surrogates or simulations.