Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change)

Abstract

The recent advances in large language models (LLMs) have transformed thefield of natural language processing (NLP). From GPT-3 to PaLM, thestate-of-the-art performance on natural language tasks is being pushed forwardwith every new large language model. Along with natural language abilities,there has been a significant interest in understanding whether such models,trained on enormous amounts of data, exhibit reasoning capabilities. Hencethere has been interest in developing benchmarks for various reasoning tasksand the preliminary results from testing LLMs over such benchmarks seem mostlypositive. However, the current benchmarks are relatively simplistic and theperformance over these benchmarks cannot be used as an evidence to support,many a times outlandish, claims being made about LLMs' reasoning capabilities.As of right now, these benchmarks only represent a very limited set of simplereasoning tasks and we need to look at more sophisticated reasoning problems ifwe are to measure the true limits of such LLM-based systems. With thismotivation, we propose an extensible assessment framework to test the abilitiesof LLMs on a central aspect of human intelligence, which is reasoning aboutactions and change. We provide multiple test cases that are more involved thanany of the previously established reasoning benchmarks and each test caseevaluates a certain aspect of reasoning about actions and change. Initialevaluation results on the base version of GPT-3 (Davinci), showcase subparperformance on these benchmarks.

Quick Read (beta)

loading the full paper ...