Abstract
Privacy restrictions hinder the sharing of real-world Water DistributionNetwork (WDN) models, limiting the application of emerging data-driven machinelearning, which typically requires extensive observations. To address thischallenge, we propose the dataset DiTEC-WDN that comprises 36,000 uniquescenarios simulated over either short-term (24 hours) or long-term (1 year)periods. We constructed this dataset using an automated pipeline that optimizescrucial parameters (e.g., pressure, flow rate, and demand patterns),facilitates large-scale simulations, and records discrete, synthetic buthydraulically realistic states under standard conditions via rule validationand post-hoc analysis. With a total of 228 million generated graph-basedstates, DiTEC-WDN can support a variety of machine-learning tasks, includinggraph-level, node-level, and link-level regression, as well as time-seriesforecasting. This contribution, released under a public license, encouragesopen scientific research in the critical water sector, eliminates the risk ofexposing sensitive data, and fulfills the need for a large-scale waterdistribution network benchmark for study comparisons and scenario analysis.