The world of empirical machine learning (ML) strongly relies on benchmarks inorder to determine the relative effectiveness of different algorithms andmethods. This paper proposes the notion of "a benchmark lottery" that describesthe overall fragility of the ML benchmarking process. The benchmark lotterypostulates that many factors, other than fundamental algorithmic superiority,may lead to a method being perceived as superior. On multiple benchmark setupsthat are prevalent in the ML community, we show that the relative performanceof algorithms may be altered significantly simply by choosing differentbenchmark tasks, highlighting the fragility of the current paradigms andpotential fallacious interpretation derived from benchmarking ML methods. Giventhat every benchmark makes a statement about what it perceives to be important,we argue that this might lead to biased progress in the community. We discussthe implications of the observed phenomena and provide recommendations onmitigating them using multiple machine learning domains and communities as usecases, including natural language processing, computer vision, informationretrieval, recommender systems, and reinforcement learning.