Large language models (LLMs) have recently demonstrated an impressive abilityto perform arithmetic and symbolic reasoning tasks when provided with a fewexamples at test time (few-shot prompting). Much of this success can beattributed to prompting methods for reasoning, such as chain-of-thought, thatemploy LLMs for both understanding the problem description by decomposing itinto steps, as well as solving each step of the problem. While LLMs seem to beadept at this sort of step-by-step decomposition, LLMs often make logical andarithmetic mistakes in the solution part, even when the problem is correctlydecomposed. We present Program-Aided Language models (PaL): a new method thatuses the LLM to understand natural language problems and generate programs asthe intermediate reasoning steps, but offloads the solution step to aprogrammatic runtime such as a Python interpreter. With PaL, decomposing thenatural language problem into runnable steps remains the only learning task forthe LLM, while solving is delegated to the interpreter. We experiment with 12reasoning tasks from BIG-Bench Hard and other benchmarks, includingmathematical reasoning, symbolic reasoning, and algorithmic problems. In allthese natural language reasoning tasks, generating code using an LLM andreasoning using a Python interpreter leads to more accurate results than muchlarger models, and we set new state-of-the-art results in all 12 benchmarks.For example, PaL using Codex achieves state-of-the-art few-shot accuracy on theGSM benchmark of math word problems when the model is allowed only a singledecoding, surpassing PaLM-540B with chain-of-thought prompting by an absolute8% .In three reasoning tasks from the BIG-Bench Hard benchmark, PaL outperformsCoT by 11%. On GSM-hard, a more challenging version of GSM that we create, PaLoutperforms chain-of-thought by an absolute 40%.