Improving the Efficiency of a Deep Reinforcement Learning-Based Power Management System for HPC Clusters Using Curriculum Learning

Abstract

High energy consumption remains a key challenge in high-performance computing(HPC) systems, which often feature hundreds or thousands of nodes drawingsubstantial power even in idle or standby modes. Although powering down unusednodes can improve energy efficiency, choosing the wrong time to do so candegrade quality of service by delaying job execution. Machine learning, inparticular reinforcement learning (RL), has shown promise in determiningoptimal times to switch nodes on or off. In this study, we enhance theperformance of a deep reinforcement learning (DRL) agent for HPC powermanagement by integrating curriculum learning (CL), a training approach thatintroduces tasks with gradually increasing difficulty. Using the Batsim-pysimulation framework, we compare the proposed CL-based agent to both a baselineDRL method (without CL) and the conventional fixed-time timeout strategy.Experimental results confirm that an easy-to-hard curriculum outperforms othertraining orders in terms of reducing wasted energy usage. The best agentachieves a 3.73% energy reduction over the baseline DRL method and a 4.66%improvement compared to the best timeout configuration (shutdown every 15minutes of idle time). In addition, it reduces average job waiting time by9.24% and maintains a higher job-filling rate, indicating more effectiveresource utilization. Sensitivity tests across various switch-on durations,power levels, and cluster sizes further reveal the agent's adaptability tochanging system parameters without retraining. These findings demonstrate thatcurriculum learning can significantly improve DRL-based power management inHPC, balancing energy savings, quality of service, and robustness to diverseconfigurations.

Quick Read (beta)

loading the full paper ...