Investigating Advanced Reasoning of Large Language Models via Black-Box Interaction

  • 2025-08-26 13:54:17
  • Congchi Yin, Tianyi Wu, Yankai Shu, Alex Gu, Yunhan Wang, Jun Shao, Xun Jiang, Piji Li
  • 0

Abstract

Existing tasks fall short in evaluating reasoning ability of Large LanguageModels (LLMs) in an interactive, unknown environment. This deficiency leads tothe isolated assessment of deductive, inductive, and abductive reasoning,neglecting the integrated reasoning process that is indispensable for humansdiscovery of real world. We introduce a novel evaluation paradigm,\textit{black-box interaction}, to tackle this challenge. A black-box isdefined by a hidden function that maps a specific set of inputs to outputs.LLMs are required to unravel the hidden function behind the black-box byinteracting with it in given exploration turns, and reasoning over observedinput-output pairs. Leveraging this idea, we build the \textsc{Oracle}benchmark which comprises 6 types of black-box task and 96 black-boxes. 19modern LLMs are benchmarked. o3 ranks first in 5 of the 6 tasks, achieving over70\% accuracy on most easy black-boxes. But it still struggles with some hardblack-box tasks, where its average performance drops below 40\%. Furtheranalysis indicates a universal difficulty among LLMs: They lack the high-levelplanning capability to develop efficient and adaptive exploration strategiesfor hypothesis refinement.

 

Quick Read (beta)

loading the full paper ...