Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach

Abstract

In Deep Reinforcement Learning models trained using gradient-basedtechniques, the choice of optimizer and its learning rate are crucial toachieving good performance: higher learning rates can prevent the model fromlearning effectively, while lower ones might slow convergence. Additionally,due to the non-stationarity of the objective function, the best-performinglearning rate can change over the training steps. To adapt the learning rate, astandard technique consists of using decay schedulers. However, theseschedulers assume that the model is progressively approaching convergence,which may not always be true, leading to delayed or premature adjustments. Inthis work, we propose dynamic Learning Rate for deep Reinforcement Learning(LRRL), a meta-learning approach that selects the learning rate based on theagent's performance during training. LRRL is based on a multi-armed banditalgorithm, where each arm represents a different learning rate, and the banditfeedback is provided by the cumulative returns of the RL policy to update thearms' probability distribution. Our empirical results demonstrate that LRRL cansubstantially improve the performance of deep RL algorithms.

Quick Read (beta)

loading the full paper ...