Abstract
Large Language Models (LLMs) have been shown to achieve breakthroughperformance on complex logical reasoning tasks. Nevertheless, most existingresearch focuses on employing formal language to guide LLMs to derive reliablereasoning paths, while systematic evaluations of these capabilities are stilllimited. In this paper, we aim to conduct a comprehensive evaluation of LLMsacross various logical reasoning problems utilizing formal languages. From theperspective of three dimensions, i.e., spectrum of LLMs, taxonomy of tasks, andformat of trajectories, our key findings are: 1) Thinking models significantlyoutperform Instruct models, especially when formal language is employed; 2) AllLLMs exhibit limitations in inductive reasoning capability, irrespective ofwhether they use a formal language; 3) Data with PoT format achieves the bestgeneralization performance across other languages. Additionally, we also curatethe formal-relative training data to further enhance the small language models,and the experimental results indicate that a simple rejected fine-tuning methodcan better enable LLMs to generalize across formal languages and achieve thebest overall performance. Our codes and reports are available athttps://github.com/jiangjin1999/FormalEval.