Abstract
Subset selection algorithms are ubiquitous in AI-driven applications,including, online recruiting portals and image search engines, so it isimperative that these tools are not discriminatory on the basis of protectedattributes such as gender or race. Currently, fair subset selection algorithmsassume that the protected attributes are known as part of the dataset. However,protected attributes may be noisy due to errors during data collection or ifthey are imputed (as is often the case in real-world settings). While a widebody of work addresses the effect of noise on the performance of machinelearning algorithms, its effect on fairness remains largely unexamined. We findthat in the presence of noisy protected attributes, in attempting to increasefairness without considering noise, one can, in fact, decrease the fairness ofthe result! Towards addressing this, we consider an existing noise model in which thereis probabilistic information about the protected attributes (e.g., [58, 34, 20,46]), and ask is fair selection possible under noisy conditions? We formulate a``denoised'' selection problem which functions for a large class of fairnessmetrics; given the desired fairness goal, the solution to the denoised problemviolates the goal by at most a small multiplicative amount with highprobability. Although this denoised problem turns out to be NP-hard, we give alinear-programming based approximation algorithm for it. We evaluate thisapproach on both synthetic and real-world datasets. Our empirical results showthat this approach can produce subsets which significantly improve the fairnessmetrics despite the presence of noisy protected attributes, and, compared toprior noise-oblivious approaches, has better Pareto-tradeoffs between utilityand fairness.