Abstract
Efficient preference optimization algorithms such as Direct PreferenceOptimization (DPO) have become a popular approach in aligning large languagemodels (LLMs) with human preferences. These algorithms implicitly treat the LLMas a reward model, and focus on training it to correct misranked preferencepairs. However, recent work~\citep{chen2024preference} empirically finds thatDPO training \textit{rarely improves these misranked preference pairs}, despiteits gradient emphasizing on these cases. We introduce FocalPO, a DPO variantthat instead \textit{down-weighs} misranked preference pairs and prioritizesenhancing the model's understanding of pairs that it can already rankcorrectly. Inspired by Focal Loss used in vision tasks, FocalPO achieves thisby adding a modulating factor to dynamically scale DPO loss. Our experimentdemonstrates that FocalPO surpasses DPO and its variants on popular benchmarkslike Alpaca Eval 2.0 using Mistral-Base-7B and Llama-3-Instruct-8B, with theintroduced hyperparameter fixed. Additionally, we empirically reveals howFocalPO affects training on correct and incorrect sample groups, furtherunderscoring its effectiveness.