Abstract
Understanding human preferences is crucial for improving foundation modelsand building personalized AI systems. However, preferences are inherentlydiverse and complex, making it difficult for traditional reward models tocapture their full range. While fine-grained preference data can help,collecting it is expensive and hard to scale. In this paper, we introduceDecomposed Reward Models (DRMs), a novel approach that extracts diverse humanpreferences from binary comparisons without requiring fine-grained annotations.Our key insight is to represent human preferences as vectors and analyze themusing Principal Component Analysis (PCA). By constructing a dataset ofembedding differences between preferred and rejected responses, DRMs identifyorthogonal basis vectors that capture distinct aspects of preference. Thesedecomposed rewards can be flexibly combined to align with different user needs,offering an interpretable and scalable alternative to traditional rewardmodels. We demonstrate that DRMs effectively extract meaningful preferencedimensions (e.g., helpfulness, safety, humor) and adapt to new users withoutadditional training. Our results highlight DRMs as a powerful framework forpersonalized and interpretable LLM alignment.