Abstract
Language models serve as proxies for human preference judgements in alignmentand evaluation, yet they exhibit systematic miscalibration, prioritizingsuperficial patterns over substantive qualities. This bias manifests asoverreliance on features like length, structure, and style, leading to issueslike reward hacking and unreliable evaluations. Evidence suggests these biasesoriginate in artifacts in human training data. In this work, we systematicallyinvestigate the relationship between training data biases and preference modelmiscalibration across five idiosyncratic features of language modelgenerations: length, structure, jargon, sycophancy and vagueness. Usingcontrolled counterfactual pairs, we first quantify the extent to whichpreference models favor responses with magnified biases (skew), finding thispreference occurs in >60% of instances, and model preferences show highmiscalibration (~40%) compared to human preferences. Notably, bias featuresonly show mild negative correlations to human preference labels (mean r_human =-0.12) but show moderately strong positive correlations with labels from astrong reward model (mean r_model = +0.36), suggesting that models may overrelyon spurious cues. To mitigate these issues, we propose a simple post-trainingmethod based on counterfactual data augmentation (CDA) using synthesizedcontrastive examples. Finetuning models with CDA reduces average miscalibrationfrom 39.4% to 32.5% and average absolute skew difference from 20.5% to 10.0%,while maintaining overall RewardBench performance, showing that targeteddebiasing is effective for building reliable preference models.