Abstract
The ability to adapt beliefs or behaviors in response to unexpected outcomes,reflection, is fundamental to intelligent systems' interaction with the world.From a cognitive science perspective, this serves as a core principle ofintelligence applicable to both human and AI systems. To address the debate onthe intelligence of large language models (LLMs), we propose Reflection-Bench,a comprehensive benchmark comprising 7 tasks spanning core cognitive functionscrucial for reflection, including perception, memory, belief updating,decision-making, prediction, counterfactual thinking, and meta-reflection. Weevaluate the performances of 13 prominent LLMs such as OpenAI o1, GPT-4, Claude3.5 Sonnet, etc. The results indicate that current LLMs still lack satisfactoryreflection ability. We discuss the underlying causes of these results andsuggest potential avenues for future research. In conclusion, Reflection-Benchoffers both evaluation tools and inspiration for developing AI capable ofreliably interacting with the environment. Our data and code are available athttps://github.com/YabYum/ReflectionBench.