Abstract
How well do AI systems perform in algorithm engineering for hard optimizationproblems in domains such as package-delivery routing, crew scheduling, factoryproduction planning, and power-grid balancing? We introduce ALE-Bench, a newbenchmark for evaluating AI systems on score-based algorithmic programmingcontests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Benchpresents optimization problems that are computationally hard and admit no knownexact solution. Unlike short-duration, pass/fail coding benchmarks, ALE-Benchencourages iterative solution refinement over long time horizons. Our softwareframework supports interactive agent architectures that leverage test-runfeedback and visualizations. Our evaluation of frontier LLMs revealed thatwhile they demonstrate high performance on specific problems, a notable gapremains compared to humans in terms of consistency across problems andlong-horizon problem-solving capabilities. This highlights the need for thisbenchmark to foster future AI advancements.