KTO: Model Alignment as Prospect Theoretic Optimization

Abstract

Kahneman & Tversky's $\textit{prospect theory}$ tells us that humans perceiverandom variables in a biased but well-defined manner (1992); for example,humans are famously loss-averse. We show that objectives for aligning LLMs withhuman feedback implicitly incorporate many of these biases -- the success ofthese objectives (e.g., DPO) over cross-entropy minimization can partly beascribed to them belonging to a family of loss functions that we call$\textit{human-aware losses}$ (HALOs). However, the utility functions thesemethods attribute to humans still differ from those in the prospect theoryliterature. Using a Kahneman-Tversky model of human utility, we propose a HALOthat directly maximizes the utility of generations instead of maximizing thelog-likelihood of preferences, as current methods do. We call this approachKTO, and it matches or exceeds the performance of preference-based methods atscales from 1B to 30B, despite only learning from a binary signal of whether anoutput is desirable. More broadly, our work suggests that there is no one HALOthat is universally superior; the best loss depends on the inductive biasesmost appropriate for a given setting, an oft-overlooked consideration.

Quick Read (beta)

loading the full paper ...