Listwise Direct Preference Optimization with Multi-Dimensional Preference Mixing

Abstract

Recent alignment methods based on Direct Preference Optimization (DPO) reformulate preference learning as supervised optimization over pairwise comparisons, offering improved efficiency and stability over reinforcement learning from human feedback (RLHF). However, existing DPO-style methods implicitly assume a single fixed preference objective, which limits their ability to model the structured and sometimes conflicting nature of real-world human judgments that span multiple preference dimensions. In this work, we propose Listwise Direct Preference Optimization ($λ$-DPO), a unified framework that simultaneously improves supervision granularity and preference flexibility. Instead of collapsing multi-dimensional preference signals into a single ranking, $λ$-DPO constructs a mixture of listwise preference distributions weighted by a preference vector $λ$ on the probability simplex, enabling a single model to internalize a continuous spectrum of preference trade-offs. To further improve robustness, we introduce a performance-driven stochastic $λ$ scheduler that adaptively samples preference weights based on empirical downstream performance, explicitly mitigating the risks of misspecification inherent to static weighting schemes. We evaluate our method across multiple model families and scales on six widely used benchmarks. Experimental results show the consistent improvement against baselines.

Quick Read (beta)

loading the full paper ...