When Do Program-of-Thoughts Work for Reasoning?

Abstract

The reasoning capabilities of Large Language Models (LLMs) play a pivotalrole in the realm of embodied artificial intelligence. Although there areeffective methods like program-of-thought prompting for LLMs which usesprogramming language to tackle complex reasoning tasks, the specific impact ofcode data on the improvement of reasoning capabilities remains under-explored.To address this gap, we propose complexity-impacted reasoning score (CIRS),which combines structural and logical attributes, to measure the correlationbetween code and reasoning abilities. Specifically, we use the abstract syntaxtree to encode the structural information and calculate logical complexity byconsidering the difficulty and the cyclomatic complexity. Through an empiricalanalysis, we find not all code data of complexity can be learned or understoodby LLMs. Optimal level of complexity is critical to the improvement ofreasoning abilities by program-aided prompting. Then we design anauto-synthesizing and stratifying algorithm, and apply it to instructiongeneration for mathematical reasoning and code data filtering for codegeneration tasks. Extensive results demonstrates the effectiveness of ourproposed approach. Code will be integrated into the EasyInstruct framework athttps://github.com/zjunlp/EasyInstruct.

Quick Read (beta)

loading the full paper ...