Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options

  • 2025-10-21 15:11:01
  • Joongkyu Lee, Seouh-won Yi, Min-hwan Oh
  • 0

Abstract

We study online preference-based reinforcement learning (PbRL) with the goalof improving sample efficiency. While a growing body of theoretical work hasemerged-motivated by PbRL's recent empirical success, particularly in aligninglarge language models (LLMs)-most existing studies focus only on pairwisecomparisons. A few recent works (Zhu et al., 2023, Mukherjee et al., 2024,Thekumparampil et al., 2024) have explored using multiple comparisons andranking feedback, but their performance guarantees fail to improve-and can evendeteriorate-as the feedback length increases, despite the richer informationavailable. To address this gap, we adopt the Plackett-Luce (PL) model forranking feedback over action subsets and propose M-AUPO, an algorithm thatselects multiple actions by maximizing the average uncertainty within theoffered subset. We prove that M-AUPO achieves a suboptimality gap of$\tilde{\mathcal{O}}\left( \frac{d}{T} \sqrt{ \sum_{t=1}^T \frac{1}{|S_t|}}\right)$, where $T$ is the total number of rounds, $d$ is the featuredimension, and $|S_t|$ is the size of the subset at round $t$. This resultshows that larger subsets directly lead to improved performance and, notably,the bound avoids the exponential dependence on the unknown parameter's norm,which was a fundamental limitation in most previous works. Moreover, weestablish a near-matching lower bound of $\Omega \left( \frac{d}{K \sqrt{T}}\right)$, where $K$ is the maximum subset size. To the best of our knowledge,this is the first theoretical result in PbRL with ranking feedback thatexplicitly shows improved sample efficiency as a function of the subset size.

 

Quick Read (beta)

loading the full paper ...