Bias Resilient Multi-Step Off-Policy Goal-Conditioned Reinforcement Learning

Abstract

In goal-conditioned reinforcement learning (GCRL), sparse rewards presentsignificant challenges, often obstructing efficient learning. Althoughmulti-step GCRL can boost this efficiency, it can also lead to off-policybiases in target values. This paper dives deep into these biases, categorizingthem into two distinct categories: "shooting" and "shifting". Recognizing thatcertain behavior policies can hasten policy refinement, we present solutionsdesigned to capitalize on the positive aspects of these biases while minimizingtheir drawbacks, enabling the use of larger step sizes to speed up GCRL. Anempirical study demonstrates that our approach ensures a resilient and robustimprovement, even in ten-step learning scenarios, leading to superior learningefficiency and performance that generally surpass the baseline and severalstate-of-the-art multi-step GCRL benchmarks.

Quick Read (beta)

loading the full paper ...