From Novelty to Imitation: Self-Distilled Rewards for Offline Reinforcement Learning

Abstract

Offline Reinforcement Learning (RL) aims to learn effective policies from astatic dataset without requiring further agent-environment interactions.However, its practical adoption is often hindered by the need for explicitreward annotations, which can be costly to engineer or difficult to obtainretrospectively. To address this, we propose ReLOAD (Reinforcement Learningwith Offline Reward Annotation via Distillation), a novel reward annotationframework for offline RL. Unlike existing methods that depend on complexalignment procedures, our approach adapts Random Network Distillation (RND) togenerate intrinsic rewards from expert demonstrations using a simple yeteffective embedding discrepancy measure. First, we train a predictor network tomimic a fixed target network's embeddings based on expert state transitions.Later, the prediction error between these networks serves as a reward signalfor each transition in the static dataset. This mechanism provides a structuredreward signal without requiring handcrafted reward annotations. We provide aformal theoretical construct that offers insights into how RND predictionerrors effectively serve as intrinsic rewards by distinguishing expert-liketransitions. Experiments on the D4RL benchmark demonstrate that ReLOAD enablesrobust offline policy learning and achieves performance competitive withtraditional reward-annotated methods.

Quick Read (beta)

loading the full paper ...