Abstract
Variance in predictions across different trained models is a significant,under-explored source of error in fair binary classification. In practice, thevariance on some data examples is so large that decisions can be effectivelyarbitrary. To investigate this problem, we take an experimental approach andmake four overarching contributions: We: 1) Define a metric calledself-consistency, derived from variance, which we use as a proxy for measuringand reducing arbitrariness; 2) Develop an ensembling algorithm that abstainsfrom classification when a prediction would be arbitrary; 3) Conduct thelargest to-date empirical study of the role of variance (vis-a-visself-consistency and arbitrariness) in fair binary classification; and, 4)Release a toolkit that makes the US Home Mortgage Disclosure Act (HMDA)datasets easily usable for future research. Altogether, our experiments revealshocking insights about the reliability of conclusions on benchmark datasets.Most fair binary classification benchmarks are close-to-fair when taking intoaccount the amount of arbitrariness present in predictions -- before we eventry to apply any fairness interventions. This finding calls into question thepractical utility of common algorithmic fairness methods, and in turn suggeststhat we should reconsider how we choose to measure fairness in binaryclassification.