Abstract
How should one judge whether a given large language model (LLM) can reliablyperform economic reasoning? Most existing LLM benchmarks focus on specificapplications and fail to present the model with a rich variety of economictasks. A notable exception is Raman et al. [2024], who offer an approach forcomprehensively benchmarking strategic decision-making; however, this approachfails to address the non-strategic settings prevalent in microeconomics, suchas supply-and-demand analysis. We address this gap by taxonomizingmicroeconomic reasoning into $58$ distinct elements, focusing on the logic ofsupply and demand, each grounded in up to $10$ distinct domains, $5$perspectives, and $3$ types. The generation of benchmark data across thiscombinatorial space is powered by a novel LLM-assisted data generation protocolthat we dub auto-STEER, which generates a set of questions by adaptinghandwritten templates to target new domains and perspectives. Because it offersan automated way of generating fresh questions, auto-STEER mitigates the riskthat LLMs will be trained to over-fit evaluation benchmarks; we thus hope thatit will serve as a useful tool both for evaluating and fine-tuning models foryears to come. We demonstrate the usefulness of our benchmark via a case studyon $27$ LLMs, ranging from small open-source models to the current state of theart. We examined each model's ability to solve microeconomic problems acrossour whole taxonomy and present the results across a range of promptingstrategies and scoring metrics.