Optimal Reward Labeling: Bridging Offline Preference and Reward-Based Reinforcement Learning

Abstract

Offline reinforcement learning has become one of the most practical RLsettings. A recent success story has been RLHF, offline preference-based RL(PBRL) with preference from humans. However, most existing works on offline RLfocus on the standard setting with scalar reward feedback. It remains unknownhow to universally transfer the existing rich understanding of offline RL fromthe reward-based to the preference-based setting. In this work, we propose ageneral framework to bridge this gap. Our key insight is transformingpreference feedback to scalar rewards via optimal reward labeling (ORL), andthen any reward-based offline RL algorithms can be applied to the dataset withthe reward labels. We theoretically show the connection between several recentPBRL techniques and our framework combined with specific offline RL algorithmsin terms of how they utilize the preference signals. By combining rewardlabeling with different algorithms, our framework can lead to new andpotentially more efficient offline PBRL algorithms. We empirically test ourframework on preference datasets based on the standard D4RL benchmark. Whencombined with a variety of efficient reward-based offline RL algorithms, thelearning result achieved under our framework is comparable to training the samealgorithm on the dataset with actual rewards in many cases and better than therecent PBRL baselines in most cases.

Quick Read (beta)

loading the full paper ...