SPEQ: Stabilization Phases for Efficient Q-Learning in High Update-To-Data Ratio Reinforcement Learning

Abstract

A key challenge in Deep Reinforcement Learning is sample efficiency,especially in real-world applications where collecting environment interactionsis expensive or risky. Recent off-policy algorithms improve sample efficiencyby increasing the Update-To-Data (UTD) ratio and performing more gradientupdates per environment interaction. While this improves sample efficiency, itsignificantly increases computational cost due to the higher number of gradientupdates required. In this paper we propose a sample-efficient method to improvecomputational efficiency by separating training into distinct learning phasesin order to exploit gradient updates more effectively. Our approach builds ontop of the Dropout Q-Functions (DroQ) algorithm and alternates between anonline, low UTD ratio training phase, and an offline stabilization phase.During the stabilization phase, we fine-tune the Q-functions without collectingnew environment interactions. This process improves the effectiveness of thereplay buffer and reduces computational overhead. Our experimental results oncontinuous control problems show that our method achieves results comparable tostate-of-the-art, high UTD ratio algorithms while requiring 56\% fewer gradientupdates and 50\% less training time than DroQ. Our approach offers an effectiveand computationally economical solution while maintaining the same sampleefficiency as the more costly, high UTD ratio state-of-the-art.

Quick Read (beta)

loading the full paper ...