How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference

Abstract

As large language models (LLMs) spread across industries, understanding theirenvironmental footprint at the inference level is no longer optional; it isessential. However, most existing studies exclude proprietary models, overlookinfrastructural variability and overhead, or focus solely on training, even asinference increasingly dominates AI's environmental impact. To bridge this gap,this paper introduces a novel infrastructure-aware benchmarking framework forquantifying the environmental footprint of LLM inference across 30state-of-the-art models as deployed in commercial data centers. Our frameworkcombines public API performance data with region-specific environmentalmultipliers and statistical inference of hardware configurations. Weadditionally utilize cross-efficiency Data Envelopment Analysis (DEA) to rankmodels by performance relative to environmental cost. Our results show that o3and DeepSeek-R1 emerge as the most energy-intensive models, consuming over 33Wh per long prompt, more than 70 times the consumption of GPT-4.1 nano, andthat Claude-3.7 Sonnet ranks highest in eco-efficiency. While a single shortGPT-4o query consumes 0.43 Wh, scaling this to 700 million queries/day resultsin substantial annual environmental impacts. These include electricity usecomparable to 35,000 U.S. homes, freshwater evaporation matching the annualdrinking needs of 1.2 million people, and carbon emissions requiring aChicago-sized forest to offset. These findings illustrate a growing paradox:although individual queries are efficient, their global scale drivesdisproportionate resource consumption. Our study provides a standardized,empirically grounded methodology for benchmarking the sustainability of LLMdeployments, laying a foundation for future environmental accountability in AIdevelopment and sustainability standards.

Quick Read (beta)

loading the full paper ...