FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming

  • 2025-07-17 17:53:55
  • Gal Beniamini, Yuval Dor, Alon Vinnikov, Shir Granot Peled, Or Weinstein, Or Sharir, Noam Wies, Tomer Nussbaum, Ido Ben Shaul, Tomer Zekharya, Yoav Levine, Shai Shalev-Shwartz, Amnon Shashua
  • 0

Abstract

Frontier AI models demonstrate formidable breadth of knowledge. But how closeare they to true human -- or superhuman -- expertise? Genuine experts cantackle the hardest problems and push the boundaries of scientificunderstanding. To illuminate the limits of frontier model capabilities, we turnaway from contrived competitive programming puzzles, and instead focus onreal-life research problems. We construct FormulaOne, a benchmark that lies at the intersection of graphtheory, logic, and algorithms, all well within the training distribution offrontier models. Our problems are incredibly demanding, requiring an array ofreasoning steps. The dataset has three key properties. First, it is ofcommercial interest and relates to practical large-scale optimisation problems,such as those arising in routing, scheduling, and network design. Second, it isgenerated from the highly expressive framework of Monadic Second-Order (MSO)logic on graphs, paving the way toward automatic problem generation at scale;ideal for building RL environments. Third, many of our problems are intimatelyrelated to the frontier of theoretical computer science, and to centralconjectures therein, such as the Strong Exponential Time Hypothesis (SETH). Assuch, any significant algorithmic progress on our dataset, beyond knownresults, could carry profound theoretical implications. Remarkably, state-of-the-art models like OpenAI's o3 fail entirely onFormulaOne, solving less than 1% of the questions, even when given 10 attemptsand explanatory fewshot examples -- highlighting how far they remain fromexpert-level understanding in some domains. To support further research, weadditionally curate FormulaOne-Warmup, offering a set of simpler tasks, fromthe same distribution. We release the full corpus along with a comprehensiveevaluation framework.

 

Quick Read (beta)

loading the full paper ...