Towards a Deeper Understanding of Reasoning Capabilities in Large Language Models

Abstract

While large language models demonstrate impressive performance on staticbenchmarks, the true potential of large language models as self-learning andreasoning agents in dynamic environments remains unclear. This studysystematically evaluates the efficacy of self-reflection, heuristic mutation,and planning as prompting techniques to test the adaptive capabilities ofagents. We conduct experiments with various open-source language models indynamic environments and find that larger models generally outperform smallerones, but that strategic prompting can close this performance gap. Second, atoo-long prompt can negatively impact smaller models on basic reactive tasks,while larger models show more robust behaviour. Third, advanced promptingtechniques primarily benefit smaller models on complex games, but offer lessimprovement for already high-performing large language models. Yet, we findthat advanced reasoning methods yield highly variable outcomes: while capableof significantly improving performance when reasoning and decision-makingalign, they also introduce instability and can lead to big performance drops.Compared to human performance, our findings reveal little evidence of trueemergent reasoning. Instead, large language model performance exhibitspersistent limitations in crucial areas such as planning, reasoning, andspatial coordination, suggesting that current-generation large language modelsstill suffer fundamental shortcomings that may not be fully overcome throughself-reflective prompting alone. Reasoning is a multi-faceted task, and whilereasoning methods like Chain of thought improves multi-step reasoning on mathword problems, our findings using dynamic benchmarks highlight importantshortcomings in general reasoning capabilities, indicating a need to movebeyond static benchmarks to capture the complexity of reasoning.

Quick Read (beta)

loading the full paper ...