Active teacher selection for reinforcement learning from human feedback

Abstract

Reinforcement learning from human feedback (RLHF) enables machine learningsystems to learn objectives from human feedback. A core limitation of thesesystems is their assumption that all feedback comes from a single humanteacher, despite querying a range of distinct teachers. We propose the HiddenUtility Bandit (HUB) framework to model differences in teacher rationality,expertise, and costliness, formalizing the problem of learning from multipleteachers. We develop a variety of solution algorithms and apply them to tworeal-world domains: paper recommendation systems and COVID-19 vaccine testing.We find that the Active Teacher Selection (ATS) algorithm outperforms baselinealgorithms by actively selecting when and which teacher to query. The HUBframework and ATS algorithm demonstrate the importance of leveragingdifferences between teachers to learn accurate reward models, facilitatingfuture research on active teacher selection for robust reward modeling.

Quick Read (beta)

loading the full paper ...