Abstract
Hyperparameter optimization (HPO) is a billion-dollar problem in machinelearning, which significantly impacts the training efficiency and modelperformance. However, achieving efficient and robust HPO in deep reinforcementlearning (RL) is consistently challenging due to its high non-stationarity andcomputational cost. To tackle this problem, existing approaches attempt toadapt common HPO techniques (e.g., population-based training or Bayesianoptimization) to the RL scenario. However, they remain sample-inefficient andcomputationally expensive, which cannot facilitate a wide range ofapplications. In this paper, we propose ULTHO, an ultra-lightweight yetpowerful framework for fast HPO in deep RL within single runs. Specifically, weformulate the HPO process as a multi-armed bandit with clustered arms (MABC)and link it directly to long-term return optimization. ULTHO also provides aquantified and statistical perspective to filter the HPs efficiently. We testULTHO on benchmarks including ALE, Procgen, MiniGrid, and PyBullet. Extensiveexperiments demonstrate that the ULTHO can achieve superior performance with asimple architecture, contributing to the development of advanced and automatedRL systems.