LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles

Abstract

With the continuous evolution and refinement of LLMs, they are endowed withimpressive logical reasoning or vertical thinking capabilities. But can theythink out of the box? Do they possess proficient lateral thinking abilities?Following the setup of Lateral Thinking Puzzles, we propose a novel evaluationbenchmark, LatEval, which assesses the model's lateral thinking within aninteractive framework. In our benchmark, we challenge LLMs with 2 aspects: thequality of questions posed by the model and the model's capability to integrateinformation for problem-solving. We find that nearly all LLMs struggle withemploying lateral thinking during interactions. For example, even the mostadvanced model, GPT-4, exhibits the advantage to some extent, yet stillmaintain a noticeable gap when compared to human. This evaluation benchmarkprovides LLMs with a highly challenging and distinctive task that is crucial toan effective AI assistant.

Quick Read (beta)

loading the full paper ...