SPEQ: Offline Stabilization Phases for Efficient Q-Learning in High Update-To-Data Ratio Reinforcement Learning

Abstract

High update-to-data (UTD) ratio algorithms in reinforcement learning (RL)improve sample efficiency but incur high computational costs, limitingreal-world scalability. We propose Offline Stabilization Phases for EfficientQ-Learning (SPEQ), an RL algorithm that combines low-UTD online training withperiodic offline stabilization phases. During these phases, Q-functions arefine-tuned with high UTD ratios on a fixed replay buffer, reducing redundantupdates on suboptimal data. This structured training schedule optimallybalances computational and sample efficiency, addressing the limitations ofboth high and low UTD ratio approaches. We empirically demonstrate that SPEQrequires from 40% to 99% fewer gradient updates and 27% to 78% less trainingtime compared to state-of-the-art high UTD ratio methods while maintaining orsurpassing their performance on the MuJoCo continuous control benchmark. Ourfindings highlight the potential of periodic stabilization phases as aneffective alternative to conventional training schedules, paving the way formore scalable reinforcement learning solutions in real-world applications wherecomputational resources are constrained.

Quick Read (beta)

loading the full paper ...