Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline

Abstract

The International Mathematical Olympiad (IMO) is widely regarded as the worldchampionship of high-school mathematics. IMO problems are renowned for theirdifficulty and novelty, demanding deep insight, creativity, and rigor. Althoughlarge language models perform well on many mathematical benchmarks, they oftenstruggle with Olympiad-level problems. Using carefully designed prompts, weconstruct a model-agnostic, verification-and-refinement pipeline. Wedemonstrate its effectiveness on the recent IMO 2025, avoiding datacontamination for models released before the competition. Equipped with any ofthe three leading models -- Gemini 2.5 Pro, Grok-4, or GPT-5 -- our pipelinecorrectly solved 5 out of the 6 problems ($\approx$85.7% accuracy). This is insharp contrast to their baseline accuracies: 31.6% (Gemini 2.5 Pro), 21.4%(Grok-4), and 38.1% (GPT-5), obtained by selecting the best of 32 candidatesolutions. The substantial improvement underscores that the path to advanced AIreasoning requires not only developing more powerful base models but alsodesigning effective methodologies to harness their full potential for complextasks.

Quick Read (beta)

loading the full paper ...