Abstract
Reinforcement learning from human feedback (RLHF) has contributed toperformance improvements in large language models. To tackle its reliance onsubstantial amounts of human-labeled data, a successful approach is multi-taskrepresentation learning, which involves learning a high-quality,low-dimensional representation from a wide range of source tasks. In thispaper, we formulate RLHF as the contextual dueling bandit problem and assume acommon linear representation. We demonstrate that the sample complexity ofsource tasks in multi-task RLHF can be reduced by considering task relevanceand allocating different sample sizes to source tasks with varying taskrelevance. We further propose an algorithm to estimate task relevance by asmall number of additional data and then learn a policy. We prove that toachieve $\varepsilon-$optimal, the sample complexity of the source tasks can besignificantly reduced compared to uniform sampling. Additionally, the samplecomplexity of the target task is only linear in the dimension of the latentspace, thanks to representation learning.