Abstract
Large language models (LLMs) can perform complex reasoning in few- andzero-shot settings by generating intermediate chain of thought (CoT) reasoningsteps. Further, each reasoning step can rely on external tools to supportcomputation beyond the core LLM capabilities (e.g. search/running code). Priorwork on CoT prompting and tool use typically requires hand-craftingtask-specific demonstrations and carefully scripted interleaving of modelgenerations with tool use. We introduce Automatic Reasoning and Tool-use (ART),a framework that uses frozen LLMs to automatically generate intermediatereasoning steps as a program. Given a new task to solve, ART selectsdemonstrations of multi-step reasoning and tool use from a task library. Attest time, ART seamlessly pauses generation whenever external tools are called,and integrates their output before resuming generation. ART achieves asubstantial improvement over few-shot prompting and automatic CoT on unseentasks in the BigBench and MMLU benchmarks, and matches performance ofhand-crafted CoT prompts on a majority of these tasks. ART is also extensible,and makes it easy for humans to improve performance by correcting errors intask-specific programs or incorporating new tools, which we demonstrate bydrastically improving performance on select tasks with minimal humanintervention.