Abstract
As large language models (LLMs) spread across industries, understanding theirenvironmental footprint at the inference level is no longer optional; it isessential. However, most existing studies exclude proprietary models, overlookinfrastructural variability and overhead, or focus solely on training, even asinference increasingly dominates AI's environmental impact. To bridge this gap,this paper introduces a novel infrastructure-aware benchmarking framework forquantifying the environmental footprint of LLM inference across 30state-of-the-art models as deployed in commercial data centers. Our frameworkcombines public API performance data with region-specific environmentalmultipliers and statistical inference of hardware configurations. Weadditionally utilize cross-efficiency Data Envelopment Analysis (DEA) to rankmodels by performance relative to environmental cost. Our results show that o3and DeepSeek-R1 emerge as the most energy-intensive models, consuming over 33Wh per long prompt, more than 70 times the consumption of GPT-4.1 nano, andthat Claude-3.7 Sonnet ranks highest in eco-efficiency. While a single shortGPT-4o query consumes 0.43 Wh, scaling this to 700 million queries/day resultsin substantial annual environmental impacts. These include electricity usecomparable to 35,000 U.S. homes, freshwater evaporation matching the annualdrinking needs of 1.2 million people, and carbon emissions requiring aChicago-sized forest to offset. These findings illustrate a growing paradox:although individual queries are efficient, their global scale drivesdisproportionate resource consumption. Our study provides a standardized,empirically grounded methodology for benchmarking the sustainability of LLMdeployments, laying a foundation for future environmental accountability in AIdevelopment and sustainability standards.