50 Years of Test (Un)fairness: Lessons for Machine Learning

  • 2018-11-25 21:48:19
  • Ben Hutchinson, Margaret Mitchell
  • 90

Abstract

Quantitative definitions of what is unfair and what is fair have beenintroduced in multiple disciplines for well over 50 years, including ineducation, hiring, and machine learning. We trace how the notion of fairnesshas been defined within the testing communities of education and hiring overthe past half century, exploring the cultural and social context in whichdifferent fairness definitions have emerged. In some cases, earlier definitionsof fairness are similar or identical to definitions of fairness in currentmachine learning research, and foreshadow current formal work. In other cases,insights into what fairness means and how to measure it have largely goneoverlooked. We compare past and current notions of fairness along severaldimensions, including the fairness criteria, the focus of the criteria (e.g., atest, a model, or its use), the relationship of fairness to individuals,groups, and subgroups, and the mathematical method for measuring fairness(e.g., classification, regression). This work points the way towards futureresearch and measurement of (un)fairness that builds from our modernunderstanding of fairness while incorporating insights from the past.

 

Quick Read (beta)

50 Years of Test (Un)fairness: Lessons for Machine Learning

Ben Hutchinson and Margaret Mitchell benhutch,[email protected]
Abstract.

Quantitative definitions of what is unfair and what is fair have been introduced in multiple disciplines for well over 50 years, including in education, hiring, and machine learning. We trace how the notion of fairness has been defined within the testing communities of education and hiring over the past half century, exploring the cultural and social context in which different fairness definitions have emerged. In some cases, earlier definitions of fairness are similar or identical to definitions of fairness in current machine learning research, and foreshadow current formal work. In other cases, insights into what fairness means and how to measure it have largely gone overlooked. We compare past and current notions of fairness along several dimensions, including the fairness criteria, the focus of the criteria (e.g., a test, a model, or its use), the relationship of fairness to individuals, groups, and subgroups, and the mathematical method for measuring fairness (e.g., classification, regression). This work points the way towards future research and measurement of (un)fairness that builds from our modern understanding of fairness while incorporating insights from the past.

copyright: rightsretainedjournalyear: 2019copyright: acmcopyrightconference: FAT* ’19: Conference on Fairness, Accountability, and Transparency; January 29–31, 2019; Atlanta, GA, USAbooktitle: FAT* ’19: Conference on Fairness, Accountability, and Transparency (FAT* ’19), January 29–31, 2019, Atlanta, GA, USAprice: 15.00doi: 10.1145/3287560.3287600isbn: 978-1-4503-6125-5/19/01\acmSubmissionID

230

1. Introduction

The United States Civil Rights Act of 1964 effectively outlawed discrimination on the basis of of an individual’s race, color, religion, sex, or national origin. The Act contained two important provisions that would fundamentally shape the public’s understanding of what it meant to be unfair, with lasting impact into modern day: Title VI, which prevented government agencies that receive federal funds (including universities) from discriminating on the basis of race, color or national origin; and Title VII, which prevented employers with 15 or more employees from discriminating on the basis of race, color, religion, sex or national origin.

Assessment tests used in public and private industry immediately came under public scrutiny. The question posed by many at the time was whether the tests used to assess ability and fit in education and employment were discriminating on bases forbidden by the new law (Ash, 1966). This stimulated a wealth of research into how to mathematically measure unfair bias and discrimination within the educational and employment testing communities, often with a focus on race. The period of time from 1966 to 1976 in particular gave rise to fairness research with striking parallels to ML fairness research from 2011 until today, including formal notions of fairness based on population subgroups, the realization that some fairness criteria are incompatible with one another, and pushback on quantitative definitions of fairness due to their limitations.

Into the 1970s, there was a shift in perspective, with researchers moving from defining how a test may be unfair to how a test may be fair. It is during this time that we see the introduction of mathematical criteria for fairness identical to the mathematical criteria of modern day. Unfortunately, this fairness movement largely disappeared by the end of the 1970s, as the different and sometimes competing notions of fairness left little room for clarity on when one notion of fairness may be preferable to another. Following the retrospective analysis of Nancy Cole (Cole and Zieky, 2001), who introduced the equivalent of Hardt et al.’s 2016 equality of opportunity (Hardt et al., 2016) in 1973:

The spurt of research on fairness issues that began in the late 1960s had results that were ultimately disappointing. No generally agreed upon method to determine whether or not a test is fair was developed. No statistic that could unambiguously indicate whether or not an item is fair was identified. There were no broad technical solutions to the issues involved in fairness.

By learning from this past, we hope to avoid such a fate.

Before further diving in to the history of testing fairness, it is useful to briefly consider the structural correspondences between tests and ML models. Test items (questions) are analogous to model features, and item responses analogous to specific activations of those features. Scoring a test is typically a simple linear model which produces a (possibly weighted) sum of the item scores. Sometimes test scores are normalized or standardized so that scores fit a desired range or distribution. Because of this correspondence, much of the math is directly comparable; and many of the underlying ideas in earlier fairness work trivially map on to modern day ML fairness. “History doesn’t repeat itself, but it often rhymes”; and by hearing this rhyme, we hope to gain insight into the future of ML fairness.

Following terminology of the social sciences, applied statistics, and the notation of (Barocas et al., 2018), we use “demographic variable” to refer to an attribute of individuals such as race, age or gender, denoted by the symbol A. We use “subgroup” to denote a group of individuals defined by a shared value of a demographic variable, e.g., A=a. Y indicates the ground truth or target variable, R denotes a score output by a model or a test, and D denotes a binary decision made using that score. We occasionally make exceptions when referencing original material.

(a) Labels on regression lines indicate which subgroup they fit.
(b) The regression line labeled πc fits both subgroups separately (and hence also their union).
Figure 1. Petersen and Novick’s (Petersen and Novick, 1976) original figures demonstrating fairness criteria. The marginal distributions of test scores and ground truth scores for subgroups π1 and π2 are shown by the axes.

2. History of fairness in testing

2.1. 1960s: Bias and Unfair Discrimination

Concerned with the fairness of tests for black and white students, T. Anne Cleary defined a quantitative measure of test bias for the first time, cast in terms of a formal model for predicting educational outcomes from test scores (Cleary, 1966, 1968):

A test is biased for members of a subgroup of the population if, in the prediction of a criterion for which the test was designed, consistent nonzero errors of prediction are made for members of the subgroup. In other words, the test is biased if the criterion score predicted from the common regression line is consistently too high or too low for members of the subgroup. With this definition of bias, there may be a connotation of “unfair,” particularly if the use of the test produces a prediction that is too low. (Emphasis added.)

According to Cleary’s criterion, the situation depicted in Figure 0(a) is biased for members of subgroup π2 if the regression line π1 is used to predict their ability, since it underpredicts their true ability. For Cleary, the situation depicted in Figure 0(b) is not biased: since data from each of the subgroups produce the same regression line, that line can be used to make predictions for either group.

In addition to defining bias in terms of predictions by regression models, Cleary also performed a study on real-world data from three state-supported and state-subsidized schools, comparing college GPA with SAT scores. Racial data was obtained from an admissions office, from an NAACP list of black students, and from examining class pictures. Cleary used Analysis of Covariance (ANCOVA) to test the relationships between SAT and HSR scores with GPA grades. Contrary to some expectations, Cleary found little evidence of the SAT being a biased predictor of GPA. (Later, larger studies found that the SAT overpredicted the GPA of black students (Vars and Bowen, 1998); it may be that the SAT is biased but less so than the GPA.)

While Cleary’s focus was on education, her contemporary Robert Guion was concerned with unfair discrimination in employment. Arguing for the importance of quantitative analyses in 1966, he wrote that: “Illegal discrimination is largely an ethical matter, but the fulfillment of ethical responsibility begins with technical competence” (Guion, 1966), and defined unfair discrimination to be “when persons with equal probabilities of success on the job have unequal probabilities of being hired for the job.” However, Guion recognized the challenges in using constructs such as the probability of success. We can observe actual success and failure after selection, but the probability of success is not itself observable, and a sophisticated model is required to estimate it at the time of selection.

By the end of the 1960s, there was political and legal support backing concerns with the unfairness of the educational system for black children and the unfairness of tests purporting to measure black intellectual competence. Responding to these concerns, the Association of Black Psychologists formed in 1969 immediately published “A Petition of Concerns”, calling for a moratorium on standardized tests “(which are used) to maintain and justify the practice of systematically denying economic opportunities” (Williams et al., 1980). The NAACP followed up on this in 1974 by adopting a resolution that demanded “a moratorium on standardized testing wherever such tests have not been corrected for cultural bias” (cited by (Samuda, 1998)). Meanwhile, advocates of testing worried that alternatives to testing such as interviews would introduce more subjective bias (Flaugher, 1974).11 1 For example, the origins of the college entrance essay are rooted in ivy league universities’ covert attempts to suppress the numbers of Jewish students, whose performance on entrance exams had led them to become an increasing percentage of the student population (Karabel, 2006).

2.2. 1970s: Fairness

As the 1960s turned to the 1970s, work began to arise that parallels the recent evolution of work in ML fairness, marking a change in framing from unfairness to fairness. Following Thorndike (Thorndike, 1971), “The discussion of ‘fairness’ in what has gone before is clearly over-simplified. In particular, it has been based upon the premise that the available criterion score is a perfectly relevant, reliable and unbiased measure…” Thorndike’s sentiment was shared by other academics of the time, who, in examining the earlier work of Cleary, objected that it failed to take into account the differing false positive and false negative rates that occur when subgroups have different base rates (i.e., A is not independent of Y) (Thorndike, 1971; Einhorn and Bass, 1971).

With the goal of moving beyond simplified models, Thorndike (Thorndike, 1971) proposed one of the first quantitative criteria for measuring test fairness. With this shift, Thorndike advocated for considering the contextual use of a test:

A judgment on test-fairness must rest on the inferences that are made from the test rather than on a comparison of mean scores in the two populations. One must then focus attention on fair use of the test scores, rather than on the scores themselves.

Contrary to Cleary, Thorndike argued that sharing a common regression line is not important, as one can achieve fair selection goals by using different regression lines and different selection thresholds for the two groups.

As an alternative to Cleary, Thorndike proposed that the ratio of predicted positives to ground truth positives be equal for each group. Using confusion matrix terminology, this is equivalent to requiring that the ratio (TP+FP)/(TP+FN) be equal for each subgroup. According to Thorndike, the situation in Figure 0(a) is fair for test cutoff x*. Figure 0(b) is unfair using any single threshold, but fair if threshold x1* is used for group π1 and threshold x2* is used for group π2.

Similar to modern day ML fairness, e.g., Friedler et al. in 2016 (Friedler et al., 2016), Thorndike also pointed out the tension between individual notions of fairness and group notions of fairness: “the two definitions of fairness—one based on predicted criterion score for individuals and the other on the distribution of criterion scores in the two groups—will always be in conflict.” The conflict was also raised by others in the period, including Sawyer et al. (Sawyer et al., 1976), in a foreshadowing of the compas debate of 2016:

A conflict arises because the success maximization procedures based on individual parity do not produce equal opportunity (equal selection for equal success) based on group parity and the opportunity procedures do not produce success maximization (equal treatment for equal prediction) based on individual parity.

Almost as an aside, Thorndike mentions the existence of another regression line ignored by Cleary: the line that estimates the value of the test score R given the target variable Y. This idea hints at the notion of equal opportunity for those with a given value of Y, an idea which soon was picked up by Darlington (Darlington, 1971) and Cole (Cole, 1973).

At a glance, Cleary’s and Thorndike’s definitions are difficult to compare directly because of the different ways in which they’re defined. Darlington (Darlington, 1971) helped to shed light on the relationship between Cleary and Thorndike’s conceptions of fairness by expressing them in a common formalism. He defines four fairness criteria in terms of the correlation ρAR between the demographic variable and the test score. Following Darlington,

  1. (1)

    Cleary’s criterion can be restated in terms of correlations of the “culture variable” with test scores. If Cleary’s criterion holds for every subgroup, then ρAR=ρAY/ρRY22 2 Although Darlington does not mention this additional constraint, we believe the criterion only holds if A, R and Y have a multivariate normal distribution. (Vargha et al., 1996).

  2. (2)

    Similarly, Thorndike’s criterion is equivalent to requiring that ρAR=ρAY.

  3. (3)

    The criterion ρAR=ρAY×ρRY is motivated by thinking about R as a dependent variable affected by independent variables A and Y. If A has no direct effect on R once Y is taken into account then we have a zero partial correlation, i.e. ρAR.Y=0]33 3 See footnote 2. .

  4. (4)

    An alternative “starkly simple” criterion of ρAR=0 (recognizable as modern day demographic parity (Dwork et al., 2012)) is introduced but not dwelt on.

Darlington’s mapping of Cleary’s and Thorndike’s criteria lets him prove that they’re incompatible except in the special cases where the test perfectly predicts the target variable (ρRY=1), or where the target variable is uncorrelated with the demographic variable (ρAY=0). Figure 2, reproduced from Darlington’s 1971 work, shows that, for any given non-zero correlation between the demographic and target variables, definitions (1), (2), and (3) converge as the correlation between the test score and the target variable approach 1. When the test has only a poor correlation with the target variable, there may be no fair solution using definition (1).

Figure 2. Darlington’s original graph of fair values of the correlation between culture and test score (rCX in Darlington’s notation), plotted against the correlation between test score and ground truth (rXY), according to his definitions (1–4). (The correlation between the demographic and target variables is assumed here to be fixed at 0.2.)

Figure 2 enables a range of further observations. According to definition (1), for a given correlation between demographic and target variables, the lower the correlation of the test with the target variable, the higher it is allowed to correlate with the demographic variable and still be considered fair. Definition (3), on the other hand, is the opposite, in that the lower the correlation of the test with the target variable, the lower too must be the the test’s correlation with the demographic variable. Darlington’s criterion (2) is the geometric mean of criteria (1) and (3): “a compromise position midway between [the] two… however, a compromise may end up satisfying nobody; psychometricians are not in the habit of agreeing on important definitions or theorems by compromise.” Darlington shows that definition (3) is the only one of the four whose errors are uncorrelated with the demographic variable, where by “errors”, he means errors in the regression task of estimating Y from R.

In 1973, Cole (Cole, 1973) continued exploring ideas of equal outcomes across subgroups, defining fairness as all subgroups having the same True Positive Rate (TPR), recognizable as modern day equality of opportunity (Hardt et al., 2016). That same year, Linn (Linn, 1973) introduced (but did not advocate for) equal Positive Predictive Value (PPV) as a fairness criterion, recognizable as modern day predictive parity (Chouldechova, 2017).44 4 Although he cites (Guion, 1966) and (Einhorn and Bass, 1971), a seeming misattribution, as pointed out by (Petersen and Novick, 1976).

Under Cleary and Darlington’s conceptions, bias or (un)fairness is a property of the test itself. This is contrary to Thorndike, Linn and Cole, who take fairness to be a property of the use of a test. The latter group tended to assume that a test is static, and focused on optimizing its use; whereas Cleary’s concerns were with how to improve the tests themselves. Cleary worked for Educational Testing Services, and one can imagine a test being designed allowing for a range of use cases, since it may not be knowable in advance either i) the precise populations on which it will be deployed, nor ii) the number of students which an institution deploying the test is able to offer places to.

By March 1976, the interest in fairness in the educational testing community was so strong that an entire issue of the Journal of Education Measurement was devoted to the topic (NCME, 1976), including a lengthy lead article by Peterson and Novick (Petersen and Novick, 1976), in which they consider for the first time the equality of True Negative Rates (TNR) across subgroups, and equal TPR / equal TNR across subgroups (modern day equalized odds (Hardt et al., 2016)). Similarly, they consider the case of equal PPV and equal NPV across subgroups.55 5 They do not advocate for either combination (neither equal TPR and TNR, nor equal PPV and NPV) on the grounds that either combination requires unusual circumstances. However there is a flaw in their reasoning. For example, arguing against equal TPR and equal TNR, they claim that this requires equal base rates in the ground truth in addition to equal TPR.

Work from the mid-1960s to mid-1970s can be summarized along four distinct categories: individual, non-comparative, subgroup parity, and correlation, defined in Table 1. It should be emphasized that in not all cases where a researcher defined a criterion did they also advocate for it. In particular, Darlington, Linn, Jones, and Peterson and Novick all define criteria purely for the purposes of exploring the space of concepts related to fairness. A summary of fairness technical definitions during this time is listed in Table 2.

Category Description
individual Fairness criterion defined purely in terms of individuals
non-comparative Fairness criterion for each subgroup does not reference other subgroups
subgroup parity Fairness criterion defined in terms of parity of some value across subgroups
correlation Fairness criterion defined in terms of the correlation of the demographic variable with the model output
Table 1. Categories of Fairness Criteria

2.3. Mid-1970s: The Fairness Tide Turns

Immediately after the the journal issue of 1976, research into quantitative definitions of test fairness seems to have come to a halt. Considering why this happened may be a valuable lesson to learn from for modern day fairness research. The same Cole who in 1973 proposed equality of TPR, wrote in 2001 that (Cole and Zieky, 2001):

In short, research over the last 30 or so years has not supplied any analyses to unequivocally indicate fairness or unfairness, nor has it produced clear procedures to avoid unfairness. To make matters worse, the views of fairness of the measurement profession and the views of the general public are often at odds.

Foreshadowing this outcome, statements from researchers in the 1970s indicate an increasing concern with how fairness criteria obscure “the fundamental problem, which is to find some rational basis for providing compensatory treatment for the disadvantaged” (Novick and Petersen, 1976). Following Peterson and Novick, the concepts of culture-fairness and group parity are not viable in practice, leading to models that can sanction the discrimination they seek to rectify (Petersen and Novick, 1976). They argue that fairness should be reconceptualized as a problem in maximizing expected utility (Petersen, 1976), recognizing “high social utility in equalizing opportunity and reducing disadvantage” (Novick and Petersen, 1976).

A related thread of work highlights that different fairness criteria encode different value systems (Hunter and Schmidt, 1976), and that quantitative techniques alone cannot answer the question of which to use. In 1971, Darlington (Darlington, 1971) urges that the concept of “cultural fairness” be replaced by “cultural optimality”, which takes into account a policy-level question concerning the optimum balance between accuracy and cultural factors. In 1974, Thorndike points out that “one’s value system is deeply involved in one’s judgment as to what is ‘fair use’ of a selection device” (Novick and Petersen, 1976)), and similarly, in 1976, Linn (Linn, 1976) draws attention to the fact that “Values are implicit in the models. To adequately address issues of values they need to be dealt with explicitly.” Hunter and Schmidt (Hunter and Schmidt, 1976) begin to address this issue by bringing ethical theory to the discussion, relating fairness to theories of individualism and proportional representation. Current work may learn from this point in history by explicitly connecting fairness criteria to different cultural and social values.

{adjustwidth}

-.075em   Source Criterion Category Proposition   Guion (1966) “people with equal probabilities of success on the job have equal probabilities of being hired for the job” individual Is the use of the test fair?   Cleary (1966) “a subgroup does not have consistent errors” non-comparative Is the test fair to subgroup a?   Einhorn and Bass (1971) Prob(Y>y*|R=ra*,A=a) is constant for all subgroups a subgroup parity Is the use of the test fair with respect to A?   Thorndike (1971) Prob(Rra*|A=a)/Prob(Yy*|A=a) is constant for all subgroups a subgroup parity Is the use of the test fair with respect to A?   Darlington (1971) (1) ρAX=ρAY/ρRY (equivalent to ρAY.R=0) correlation Is the test fair with respect to A?   Darlington (1971) (2) ρAR=ρAY   Darlington (1971) (3) ρAR=ρAY×ρRY (equivalent to ρAR.Y=0)   Darlington (1971) (4) ρAR=0   Darlington (1971) ρR(Y-kA), is maximized where k is the subjective value placed on subgroup attribute A=1 correlation Does the test produce the   culturally optimum optimal outcome w.r.t. A?   Cole (1973) Prob(Rra*|Yy*,A=a) is constant for all subgroups a subgroup parity Is the use of the test fair with respect to A?   Linn (1973) Prob(Yy*|Rra*,A=a) is constant for all subgroups a subgroup parity Is the use of the test fair with respect to A?   Jones (1973) E(Y^|a)=E(Y|a) non-comparative Is the test fair to subgroup a?   mean fair   Jones (1973) a subgroup a has equal representation in the top-n candidates ranked by model score as it has in the top-n candidates ranked by Y, for all n non-comparative Is the test fair to subgroup a?   general standard   Jones (1973) a subgroup a has equal representation in the top-n candidates ranked by model score as it has in the top-n candidates ranked by Y non-comparative Is the use of the test fair to subgroup a?   at position n   Peterson & Novick (1976) Prob(Rra*|Yy*,A=a) is constant for all subgroups a, and Prob(R<ra*|Y<y*,A=a) is constant for all subgroups a subgroup parity Is the use of the test fair with respect to A?   conditional probability and its converse   Peterson & Novick (1976) Prob(Yy*|Rra*,A=a) is constant for all subgroups a, and Prob(Y<y*|R<ra*,A=a) is constant for all subgroups a subgroup parity Is the use of the test fair with respect to A?   equal probability and its converse

Table 2. Early technical definitions of fairness in educational and employment testing. Variables: R is the test score; Y is the target variable; A is the demographic variable. The Proposition column indicates whether fairness is considered a property of the way in which a test is used, or of the test itself. indicates that the criterion is discussed in the appendix.

2.4. 1970s on: Differential Item Functioning

Concurrent with the development of criteria for the fair use of tests, another line of research in the measurement community concerned looking for bias in test questions (“items”). In 1968, Cleary and Hilton (Cleary and Hilton, 1968) used an analysis of variance (ANOVA) design to test the interaction between race, socioeconomic level and test item. Ten years later, the related idea of Differential Item Functioning (DIF) was introduced by Scheuneman in 1979 (Scheuneman, 1979): “an item is considered unbiased if, for persons with the same ability in the area being measured, the probability of a correct response on the item is the same regardless of the population group membership of the individual.” That is, if I=I(q) is the variable representing a correct response on question q, then by this definition I is unbiased if AI|Y.

In practice, the best measure of the ability that the item is testing is often the test in which the item is a component (Dorans, 2017):

A major change from focusing primarily on fairness in a domain, where so many factors could spoil the validity effort, to a domain where analyses could be conducted in a relatively simple, less confounded way. … In a DIF analysis, the item is evaluated against something designed to measure a particular construct and something that the test producer controls, namely a test score.

Figure 3 illustrates DIF for a test item.

Figure 3. Original graph from (Dorans and Holland, 1992) illustrating DIF.

DIF became very influential in the education field, and to this day DIF is in the toolbox of test designers. Items displaying a DIF are ideally examined further to identify the cause of bias, and possibly removed from the test (Penfield, 2016).

2.5. 1980s and beyond

With the start of the 1980s came renewed public debate about the existence of racial differences in general intelligence, and the implications for fair testing, following the publication of the controversial Bias in Mental Testing (Jensen, 1980). Political opponents of group-based considerations in educational and employment practices framed them in terms of “preferential treatment” for minorities and “reverse discrimination” against whites. Despite, or perhaps because of, much public debate, neither Congress nor the courts gave unambiguous answers to the question of how to balance social justice considerations with the historical and legal importance placed on the individual in the United States (Council et al., 1989).

Into the 1980s, courts were asked to rule on many cases involving (un)fairness in educational testing. To give just one example, Zwick and Dorans (Zwick and Dorans, 2016) described the case of Debra P. v. Turlington 1984, in which a lawsuit was filed on behalf of “present and future twelfth grade students who had failed or would fail” a high school graduation test. The initial ruling found that the test perpetuated past discrimination and was in violation of the Civil Rights Act. More examples of court rulings on fairness are given by (Phillips, 2016; Zwick and Dorans, 2016).

By the early 1980s, ideas about fairness were having a widespread influence on U.S. employment practices. In 1981, with no public debate, the United States Employment Services implemented score-adjustment strategy that was sometimes called “race-norming” (Rice and Baptiste, 1994). Each individual is assigned a percentile ranking within their own ethnic group, rather than to the test-taking population. By the mid-1980s, race-norming was “a highly controversial issue sparking heated debate.” The debate was settled through legislation, with the 1991 Civil Rights Act banning the practice of race-norming (West-Faulcon, 2011).

3. Connections to ML fairness

3.1. Equivalent Notions

Many of the fairness criteria we have overviewed are identical to modern-day fairness definitions. Here is a brief summary of these connections:

  • Peterson and Novick’s “conditional probability and its converse” is equivalent to what in ML fairness is variously called sufficiency (Barocas et al., 2018), equalized odds (Hardt et al., 2016), or conditional procedure accuracy (Berk et al., 2017), sometimes expressed as the conditional independence AD|Y.

  • Similarly, their “equal probability and its converse” is equivalent to what is called sufficiency (Barocas et al., 2018) or conditional use accuracy equality (Berk et al., 2017), AY|D.

  • Cole’s 1973 fairness definition is identical to equality of opportunity (Hardt et al., 2016), AD|Y=1.

  • Linn’s 1973 definition is equivalent to predictive parity (Chouldechova, 2017), AY|D=1.

  • Darlington’s criterion (1) is equivalent to sufficiency in the special case where A, R and Y have a multivariate Gaussian distribution. This is because for this special case the partial correlation ρAY.X=0 is equivalent to AY|R (Baba et al., 2004). In general though, we cannot assume even a one way implication, since AY|R does not imply ρAY.X=0 (see (Vargha et al., 1996) for a counterexample).

  • Similarly, Darlington’s criteria (2) and (3) are equivalent to independence and separation only in the special cases of multivariate Gaussian distributions.

  • Darlington’s definition (4) is a relaxation of what is called independence (Barocas et al., 2018) or demographic parity in ML fairness, i.e. AR; it is equivalent when A and R have a bivariate Gaussian distribution.

  • Guion’s definition “people with equal probabilities of success on the job have equal probabilities of being hired for the job” is a special case of Dwork’s (Dwork et al., 2012) individual fairness with the presupposition that “probability of success on the job” is a construct that can be meaningfully reasoned about.

The fairness literature in both the fields of ML and in testing have also been motivated by causal considerations (Kusner et al., 2017; Hardt et al., 2016). Darlington (Darlington, 1971) motivate his definition (3) on the basis of a causal relationship between Y and R (since an ability being measured affects the performance on the test). However (Hunter and Schmidt, 1976) have pointed out that in testing scenarios we typically only have a proxy for ability, such as later GPA 4 years later, and it is wrong to draw a causal connection from GPA to college entrance exam.

Hardt et al. (Hardt et al., 2016) describe the challenge in building causal models, by considering two distinct models and their consequences and concluding that “no test based only on the target labels, the protected attribute and the score would give different indications for the optimal score R* in the two scenarios.” This is remarkably reminiscent of Anastasi (Anastasi, 1961), writing in 1961 about test fairness:

No test can eliminate causality. Nor can a test score, however derived, reveal the origin of the behavior it reflects. If certain environmental factors influence behavior, they will also influence those samples of behavior covered by tests. When we use tests to compare different groups, the only question the tests can answer directly is: “How do these groups differ under existing cultural conditions?”

Both the testing fairness and ML fairness literatures have also paid great attention to impossibility results, such as the distinction between group fairness and individual fairness, and the impossibility of obtaining more than one of separation, sufficiency and independence except under special conditions (Thorndike, 1971; Darlington, 1971; Petersen and Novick, 1976; Barocas et al., 2018; Chouldechova, 2017; Kleinberg et al., 2016).

In addition, we see some striking parallels in the framing of fairness in terms of ethical theories, including explicit advocacy for utilitarian approaches.

  • Petersen and Novick’s utility-based approaches relate to Corbett-Davies et al.’s framing of the cost of fairness (Corbett-Davies et al., 2017).

  • Hunter and Schmidt’s analysis of the value systems underlying fairness criteria is similar in spirit to Friedler et al.’s relation of fairness criteria and different worldviews (Friedler et al., 2016).

3.2. Variable Independence

As briefly mentioned above, modern day ML fairness has categorized fairness definitions in terms of independence of variables, which includes sufficiency and separation (Barocas et al., 2018). Some historical notions of fairness neatly fit into this categorization, but others shed light on further dimensions of fairness criteria. Table 3 summarizes these connections, linking the historical criteria introduced in Section 2 to modern day categories. (Utility-based criteria are omitted, but will be discussed below.)

Historical criterion ML fairness criterion Relationship
Guion (1966) individual relaxation
Cleary (1968) sufficiency when Cleary’s criterion holds for all subgroups then we we have equivalence when R and Y have bivariate Gaussian distribution
Einhorn and Bass (1971) sufficiency both involve probability of Y conditioned on R, but Einhorn and Bass are only concerned with the conditional likelihood at the decision threshold
Thorndike (1971)
Darlington (1971) (1) sufficiency equivalent when variables have a multivariate Gaussian distribution
Darlington (1971) (2)
Darlington (1971) (3) separation equivalent when variables have a multivariate Gaussian distribution
Darlington (1971) (4) independence equivalent when variables have a bivariate Gaussian distribution
Cole (1973) separation relaxation (equivalent to equality of opportunity)
Linn (1973) sufficiency relaxation (equivalent to predictive parity)
Jones (1973) mean fair
Jones (1973) at position n
Jones (1973) general criterion
Peterson and Novick (1976) separation equivalent
conditional probability and its converse
Peterson and Novick (1976) sufficiency equivalent
equal probability and its converse
Table 3. Relationships between testing criteria and ML’s independence criteria

We find that non-comparative criteria (discussed by Cleary and Jones) do not map onto any of the independence conditions used in ML fairness. Similarly, Thorndike’s, and Darlington’s have no counterparts that we know of. There are conceptual similarities between Jones’ criteria and the constrained ranking problem described by (Celis et al., 2017), and also between Einhorn’s criterion and concerns about infra-marginality (Simoiu et al., 2017).

For a binary classifier, Thorndike’s 1971 group parity criterion is equivalent to requiring that the ratio of positive predictions to ground truth positives be equal for all subgroups. This ratio has no common name that we could find (unlike e.g., precision, recall, etc.), although (Petersen and Novick, 1976) refer to this as the “Constant Ratio Model”. It is closely related to coverage constraints (Goh et al., 2016), class mass normalization (Zhu et al., 2003) and expectation regularization (Mann and McCallum, 2007). Similar arguments can be made for Darlington’s criterion (2) and Jones’ criteria “at position n” and “general criterion”. When viewed as a model of subgroup quotas (Hunter and Schmidt, 1976), Thorndike’s criterion is reminiscent of fair division in economics.

3.3. Regression and Correlation

In reviewing the history of fairness in testing, it becomes clear that regression models have played a much larger role than in the ML community. Similarly, the use of correlation as a fairness criterion is all but absent in modern ML Fairness literature.

Given that correlation of two variables is a weaker criterion than independence, it is reasonable to ask why one might want a fairness criterion defined in terms of correlations. One practical reason is that calculating correlations is a lot easier than estimating independence. Whereas correlation is a descriptive statistic, and so calculating requires few assumptions, estimating independence requires an the use of inferential statistics, which can in general be highly non-trivial (Shah and Peters, 2018).

Considering the analogy between model features and test items described in the Introduction, we also know of no ML analogs to the Differential Item Functioning. Such analogs might test for bias in model features. Instead, one approach adopted in ML fairness has been the use of adversarial methods to mitigate the effects of features with undesirable correlations with subgroups, e.g., (Beutel et al., 2017; Zhang et al., 2018).

3.4. Model vs. Model Use

Section 2 described how the test literature had competing notions of whether fairness is a property of a test, or of the use of a test. A similar discussion of whether ML models can be judged as fair or unfair independent of a specific use (including a specific model threshold) has been largely implicit or missing in the ML fairness literature. Models are sometimes trained to be “fair” at their default decision threshold (e.g., 0.5), although the use of different thresholds can have a major impact on fairness (Hardt et al., 2016). The ML fairness notion of calibration, i.e., P(Y=1|A=a,R=r)=r for all a and r, can be interpreted to be a property of the model rather than of its use, since it does not depend on the choice of decision threshold.

3.5. Race and Gender

Some work on practically assessing fairness in ML has tackled the problem of using race as a construct. This echoes concerns in the testing literature that stem back to at least 1966: “one stumbles immediately over the scientific difficulty of establishing clear yardsticks by which people can be classified into convenient racial categories” (Guion, 1966). Recent approaches have used Fitzpatrick skin type or unsupervised clustering to avoid racial categorizations (Buolamwini and Gebru, 2018; Ryu et al., 2018). We note that the testing literature of the 1960s and 1970s frequently uses the phrase “cultural fairness” when referring to parity between blacks and whites. Other than Thomas (Thomas, 1973), the test fairness literature of the 1960s and 1970s was typically concerned with race rather than gender (although received attention later, e.g., (Willingham and Cole, 2013)). The role of culture in gender identity and gender presentation has seen less consideration in ML fairness, but gender labels raise ethical concerns (Hoffmann, 2017; Hamidi et al., 2018).

Comparable to modern sentiment in the difficulties of measuring fairness, earlier decisions in the courtroom highlighted the impossibility of properly accounting for all factors that influence inequalities. For example, in 1964, Illinois Fair Employment Practices Commission (FEPC) examiner found that Motorola had discriminated against Leon Myart, a black American, in his application to work at Motorola as an “analyzer and phaser”. The examiner found that the 5 minute screening test that Myart took did not account for inequalities and environmental factors of culturally deprived groups. The case was appealed to the Illinois Supreme Court, which found that Myart actually passed the test, and so declined to rule on the fairness of the test (Ash, 1966).

4. Fairness Gaps

4.1. Fairness and Unfairness

In mapping out earlier fairness approaches and their relationship to ML fairness, some conceptual gaps emerge. One noticeable gap relates to the difference in framing between fairness and unfairness. In earlier work on test fairness, there was a focus on defining measurements in terms of unfair discrimination and unfair bias, which brought with it the problem of uncovering sources of bias (Cleary and Hilton, 1968). In the 1970s, this developed into framings in terms of fairness, and the introduction of fairness criteria similar or identical to ML fairness criteria known today. However, returning to the idea of unfairness suggests several new areas of inquiry, including quantifying different kinds of unfairness and bias (such as content bias, selection system bias, etc., cf. (Jencks, 1998)), and a shift in focus from outcomes to inputs and processes (Cojuharenco and Patient, 2013). Quantifying types of unfairness may not only add to the problems that machine learning can address, but also accords with realities of sentencing and policing behind much of the fairness research today: Individuals seeking justice do so when they believe that something has been unfair.

4.2. Differential Item Functioning

Another gap that becomes clear from the historical perspective is the lack of an analog to Differential Item Functioning (Section 2.4) in current ML fairness research. DIF was used by education professionals as a motivation for investigating causes of bias, and a modern-day analog might include unfairness interpretability in ML models. An direct analog in ML could be to compare P(Xi|R=r,A=a) for different input features Xi, model outputs R and subgroups A. For example, when predicting loan repayment, this might involve comparing how income levels differ across subgroups for a given predicted likelihood of repaying the loan.

4.3. Target Variable / Model Score Relationship

Another gap is the ways in which the model (test) score and the target variable are related to each other. In many cases in ML fairness and test fairness, there are correspondences between pairs of criteria which differ only in the roles played by the model (test) score R and the target variable Y. That is, one criterion can be transformed into another by swapping the symbols R and Y; for example, separation can be transformed into sufficiency: AR|YAY|R. In this section we will refer to this type of correspondence as “converse”, i.e., separation is the converse of sufficiency.

When viewed in this light, some asymmetries stand out:

  • Converse Cleary criterion: Cleary’s criterion considers the case of a regression model that predicts a target variable Y given test score R. One could also consider the converse regression model (mentioned in passing by (Thorndike, 1971)), which predicts model score R from ground truth Y, as an instrument for detecting bias.66 6 The Cleary regression model and its converse are distinct except in the special case where the magnitudes of the variables have been standardized. The converse Cleary condition would say that a test has connotations of unfair for a subgroup if the converse regression line has positive errors, i.e., for each given level of ground truth ability, the test score is higher than the converse regression line predicts.

  • Converse calibration: In a regression scenario, the calibration condition P(Y=1|R=r,A=a)=r can be rewritten as E(Y|R=r,A=a)=r, or E(Y-r|R=r,A=a)=0. The converse calibration condition is therefore E(R-y|Y=y,A=a)=0 for all subgroups A=a. In other words, for each subgroup and level of ground truth performance Y=y, the expected error in R’s prediction of the value y is zero.

We point out these overlooked concepts not to advocate for their use, but to map out the geography of concepts related to fairness more completely.

4.4. Compromises

Darlington (Darlington, 1971) points out that Thorndike’s criterion is a compromise between one criterion related to sufficiency and one related to separation (see Section 2.2 and Tables 2 and 3). In general, a space of compromises is possible; in terms of correlations, this might be modeled using a parameter λ:

(1) ρAR=ρAY.ρRYλ

where λ values of -1, 0, and 1 imply Darlington’s definitions (1), (2) and (3), respectively.

This also suggests exploring interpolations between the contrasting sufficiency and separation criteria. For example, one way of parameterizing their interpolation is in terms of binary confusion matrix outcomes.

Definition 4.1 ().

(λ1,λ2)-Thorndikian fairness: A binary classifier satisfies (λ1,λ2)-Thorndikian fairness with respect to demographic variable A if both

  1. a)

    TP+λ1FPTP+λ2FN is constant for all values of A , and

  2. b)

    TN+λ1FNTN+λ2FP is constant for all values of A.

Note that (1, 0)-Thorndikian fairness is equivalent to sufficiency, while (0, 1)-Thorndikian fairness is equivalent to separation.

Petersen and Novick (Petersen and Novick, 1976) showed that (1,1)-Thorndikian fairness requires that either a) for each subgroup, the positive class is predicted in proportion to its ground truth rate; or b) every subgroup has the same ground truth rate of positives. We can also consider relaxations of (λ1,λ2)-Thorndikian fairness in which only one of the two conditions (a) or (b) is required to hold. For example, only requiring condition (a) gives us a way of parameterizing compromises between equality of opportunity and predictive parity.

Our goal here is not to advocate for this particular model of compromise between separation and sufficiency. Rather, since separation and sufficiency criteria can encode competing interests of different parties, our goal is to suggest that ML fairness consider how to encode notions of compromise, which in some scenarios might relate to the public’s notion of fairness. We propose that the economics literature on fair division might provide some useful ideas, as has also been suggested by (Zafar et al., 2017). However, we do heed Darlington’s (Darlington, 1971) warning that “a compromise may end up satisfying nobody; psychometricians are not in the habit of agreeing on important definitions or theorems by compromise.” This statement may be equally true of ML practitioners.

5. Discussion

This short review of historical connections in fairness suggest several concrete steps forward for future research in ML fairness:

  1. (1)

    Developing methods to explain and reduce model unfairness by focusing on the causes of unfairness. To paraphrase Darlington’s (Darlington, 1971) question: “What can be said about models that discriminate among cultures at various levels?” yields more actionable insights than “What is a fair model?” This is related to research on causality in ML Fairness (see Section 3.1), but including examination of full causal pathways, and processes that interact well before decision time. In other words: What causes the disparities?

  2. (2)

    Drawing from earlier insights of Guion (Guion, 1966), Thorndike (Thorndike, 1971), Cole (Cole, 1973), Linn (Linn, 1973), Jones (Jones, 1973), and Peterson & Novick (Petersen and Novick, 1976) to expand fairness criteria to include model context and use.

  3. (3)

    Building from earlier insights of 1970s researchers (Darlington, 1971; Hunter and Schmidt, 1976; Linn, 1976) to incorporate quantitative factors for the balance between fairness goals and other goals, such as a value system or a system of ethics. This will likely include clearly articulating assumptions and choices, as recently proposed in (Mitchell et al., 2018).

  4. (4)

    Diving more deeply into the question of how subgroups are defined, suggested as early as 1966 (Guion, 1966), including questioning whether subgroups should be treated as discrete categories at all, and how intersectionality can be modeled. This might include, for example, how to quantify fairness along one dimension (e.g., age) conditioned on another dimension (e.g., skin tone), as recent work has begun to address (Kearns et al., 2018; Foulds and Pan, 2018).

6. Conclusions

The spike in interest in test fairness in the 1960s arose during a time of social and political upheaval, with quantitative definitions catalyzed in part by U.S. federal anti-discrimination legislation in the domains of education and employment. The rise of interest in fairness today has corresponded with public interest in the use of machine learning in criminal sentencing and predictive policing, including discussions around compas (Larson et al., 2016; Dieterich et al., 2016; Corbett-Davies et al., 2016) and PredPol (O’Neil, 2016; Ensign et al., 2017). Each era gave rise to its own notions of fairness and relevant subgroups, with overlapping ideas that are similar or identical. In the 1960s and 1970s, the fascination with determining fairness ultimately died out as the work became less tied to the practical needs of society, politics and the law, and more tied to unambiguously identifying fairness.

We conclude by reflecting on what further lessons the history of test fairness may have for the future of ML fairness. Careful attention should be paid to legal and public concerns about fairness. The experiences of the test fairness field suggest that in the coming years, courts may start ruling on the fairness of ML models. If technical definitions of fairness stray too far from the public’s perceptions of fairness, then the political will to use scientific contributions in advance of public policy may be difficult to obtain. Perhaps ML practitioners should cautiously take heed from Cole and Zieky’s (Cole and Zieky, 2001) portrayal of developments in their field:

Members of the public continue to see apparently inappropriate interpretations of test scores and misuses of test results. They see this area as a primary fairness concern. However, the measurement profession has struggled to understand the nature of its responsibility in this area, and has generally not acted strongly against instances of misuse, nor has it acted in concert to attack misuses.

We welcome broader debate on fairness that includes both technical and cultural causes, how the context and use of ML models further influence potential unfairness, and the suitability of the variables used in fairness research for capturing systemic unfairness. We agree with Linn’s (Linn, 1976) argument from 1976 that values encoded by technical definitions should be made explicit. By concretely relating fairness debates to ethical theories and value systems (as done by (Hunter and Schmidt, 1976; Zwick and Dorans, 2016)), we can make discussions more accessible to the general public and to researchers of other disciplines, as well as helping our own ML Fairness community to be more attuned to our own implicit cultural biases.

7. Acknowledgements

Thank you to Moritz Hardt and Shira Mitchell for invaluable conversations and insight.

References

  • (1)
  • Anastasi (1961) Anne Anastasi. 1961. Psychological tests: Uses and abuses. Teachers College Record (1961).
  • Ash (1966) Philip Ash. 1966. The implications of the Civil Rights Act of 1964 for psychological assessment in industry. American Psychologist 21, 8 (1966), 797.
  • Baba et al. (2004) Kunihiro Baba, Ritei Shibata, and Masaaki Sibuya. 2004. Partial correlation and conditional correlation as measures of conditional independence. Australian & New Zealand Journal of Statistics 46, 4 (2004), 657–664.
  • Barocas et al. (2018) Solon Barocas, Moritz Hardt, and Arvind Naranayan. 2018. Fairness in Machine Learning. http://fairmlbook.org. (2018).
  • Berk et al. (2017) Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. 2017. Fairness in criminal justice risk assessments: the state of the art. arXiv preprint arXiv:1703.09207 (2017).
  • Beutel et al. (2017) Alex Beutel, Jilin Chen, Zhe Zhao, and Ed H. Chi. 2017. Data Decisions and Theoretical Implications when Adversarially Learning Fair Representations. CoRR abs/1707.00075 (2017). arXiv:1707.00075 http://arxiv.org/abs/1707.00075
  • Buolamwini and Gebru (2018) Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency. 77–91.
  • Celis et al. (2017) L Elisa Celis, Damian Straszak, and Nisheeth K Vishnoi. 2017. Ranking with fairness constraints. arXiv preprint arXiv:1704.06840 (2017).
  • Chouldechova (2017) Alexandra Chouldechova. 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data 5, 2 (2017), 153–163.
  • Cleary (1966) T. Anne Cleary. 1966. Test bias: Validity of the Scholastic Aptitude Test for Negro and white students in integrated colleges. ETS Research Bulletin Series 1966, 2 (1966), i–23.
  • Cleary (1968) T. Anne Cleary. 1968. Test bias: Prediction of grades of Negro and white students in integrated colleges. Journal of Educational Measurement 5, 2 (1968), 115–124.
  • Cleary and Hilton (1968) T Anne Cleary and Thomas L Hilton. 1968. An investigation of item bias. Educational and Psychological Measurement 28, 1 (1968), 61–75.
  • Cojuharenco and Patient (2013) Irina Cojuharenco and David Patient. 2013. Workplace fairness versus unfairness: Examining the differential salience of facets of organizational justice. Journal of Occupational and Organizational Psychology 86, 3 (2013), 371–393.
  • Cole (1973) Nancy S Cole. 1973. Bias in selection. Journal of educational measurement 10, 4 (1973), 237–255.
  • Cole and Zieky (2001) Nancy S Cole and Michael J Zieky. 2001. The new faces of fairness. Journal of Educational Measurement 38, 4 (2001), 369–382.
  • Corbett-Davies et al. (2016) Sam Corbett-Davies, Emma Pierson, Avi Feller, and Sharad Goel. 2016. A computer program used for bail and sentencing decisions was labeled biased against blacks. Its actually not that clear. https://www.washingtonpost.com/news/monkey-cage/wp/2016/10/17/can-an-algorithm-be-racist-our-analysis-is-more-cautious-than-propublicas/. (2016).
  • Corbett-Davies et al. (2017) Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. 2017. Algorithmic decision making and the cost of fairness. CoRR abs/1701.08230 (2017). arXiv:1701.08230 http://arxiv.org/abs/1701.08230
  • Council et al. (1989) National Research Council et al. 1989. Fairness in employment testing: Validity generalization, minority issues, and the General Aptitude Test Battery. National Academies Press.
  • Darlington (1971) Richard B Darlington. 1971. Another Look at Cultural Fairness. Journal of Educational Measurement 8, 2 (1971), 71–82.
  • Dieterich et al. (2016) William Dieterich, Christina Mendoza, and Tim Brennan. 2016. COMPAS risk scales: Demonstrating accuracy equity and predictive parity. http://go.volarisgroup.com/rs/430-MBX-989/images/ProPublica_Commentary_Final_070616.pdf. (2016).
  • Dorans (2017) Neil J Dorans. 2017. Contributions to the Quantitative Assessment of Item, Test, and Score Fairness. In Advancing Human Assessment. Springer, 201–230.
  • Dorans and Holland (1992) Neil J Dorans and Paul W Holland. 1992. DIF Detection and Description: Mantel-Haenszel and Standardization. ETS Research Report Series 1992, 1 (1992), i–40.
  • Dwork et al. (2012) Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness Through Awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (ITCS ’12). ACM, New York, NY, USA, 214–226. https://doi.org/10.1145/2090236.2090255
  • Einhorn and Bass (1971) Hillel J Einhorn and Alan R Bass. 1971. Methodological considerations relevant to discrimination in employment testing. Psychological Bulletin 75, 4 (1971), 261.
  • Ensign et al. (2017) Danielle Ensign, Sorelle A Friedler, Scott Neville, Carlos Scheidegger, and Suresh Venkatasubramanian. 2017. Runaway feedback loops in predictive policing. arXiv preprint arXiv:1706.09847 (2017).
  • Flaugher (1974) Ronald L Flaugher. 1974. Bias in Testing: A Review and Discussion. TM Report No. 36. Technical Report. Educational Testing Services.
  • Foulds and Pan (2018) James R. Foulds and Shimei Pan. 2018. An Intersectional Definition of Fairness. CoRR abs/1807.08362 (2018).
  • Friedler et al. (2016) Sorelle A Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian. 2016. On the (im) possibility of fairness. arXiv preprint arXiv:1609.07236 (2016).
  • Goh et al. (2016) Gabriel Goh, Andrew Cotter, Maya Gupta, and Michael P Friedlander. 2016. Satisfying real-world goals with dataset constraints. In Advances in Neural Information Processing Systems. 2415–2423.
  • Guion (1966) Robert M Guion. 1966. Employment tests and discriminatory hiring. Industrial Relations: A Journal of Economy and Society 5, 2 (1966), 20–37.
  • Hamidi et al. (2018) Foad Hamidi, Morgan Klaus Scheuerman, and Stacy M Branham. 2018. Gender Recognition or Gender Reductionism?: The Social Implications of Embedded Gender Recognition Systems. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 8.
  • Hardt et al. (2016) Moritz Hardt, Eric Price, , and Nati Srebro. 2016. Equality of Opportunity in Supervised Learning. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 3315–3323. http://papers.nips.cc/paper/6374-equality-of-opportunity-in-supervised-learning.pdf
  • Hoffmann (2017) Anna Lauren Hoffmann. 2017. Data, technology, and gender: Thinking about (and from) trans lives. In Spaces for the Future. Routledge, 15–25.
  • Hunter and Schmidt (1976) John E Hunter and Frank L Schmidt. 1976. Critical analysis of the statistical and ethical implications of various definitions of test bias. Psychological Bulletin 83, 6 (1976), 1053.
  • Jencks (1998) Christopher Jencks. 1998. Racial bias in testing. The Black-White test score gap 55 (1998), 84.
  • Jensen (1980) Arthur R Jensen. 1980. Bias in mental testing. (1980).
  • Jones (1973) Marshall B Jones. 1973. Moderated regression and equal opportunity. Educational and Psychological Measurement 33, 3 (1973), 591–602.
  • Karabel (2006) Jerome Karabel. 2006. The chosen: The hidden history of admission and exclusion at Harvard, Yale, and Princeton. Houghton Mifflin Harcourt.
  • Kearns et al. (2018) Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. 2018. Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness. In ICML.
  • Kleinberg et al. (2016) Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. 2016. Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807 (2016).
  • Kusner et al. (2017) Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. 2017. Counterfactual fairness. In Advances in Neural Information Processing Systems. 4066–4076.
  • Larson et al. (2016) Jeff Larson, Surya Mau, Lauren Kirchner, and Julia Angwin. 2016. How We Analyzed the COMPAS Recidivism Algorithm. https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm. (2016).
  • Linn (1973) Robert L Linn. 1973. Fair test use in selection. Review of Educational Research 43, 2 (1973), 139–161.
  • Linn (1976) Robert L Linn. 1976. In search of fair selection procedures. Journal of Educational Measurement 13, 1 (1976), 53–58.
  • Mann and McCallum (2007) Gideon S Mann and Andrew McCallum. 2007. Simple, robust, scalable semi-supervised learning via expectation regularization. In Proceedings of the 24th international conference on Machine learning. ACM, 593–600.
  • Mitchell et al. (2018) Shira Mitchell, Eric Potash, and Solon Barocas. 2018. Prediction-Based Decisions and Fairness: A Catalogue of Choices, Assumptions, and Definitions. arXiv:1811.07867 (2018).
  • NCME (1976) National Council on Measurement in Education NCME (Ed.). 1976. Journal of Education Measurement. 13, 1 (1976).
  • Novick and Petersen (1976) Melvin R Novick and Nancy S Petersen. 1976. Towards equalizing educational and employment opportunity. Journal of Educational Measurement 13, 1 (1976), 77–88.
  • O’Neil (2016) Cathy O’Neil. 2016. Weapons of math destruction: How big data increases inequality and threatens democracy. Broadway Books.
  • Penfield (2016) Randall D Penfield. 2016. Fairness in Test Scoring. In Fairness in Educational Assessment and Measurement. Routledge, 71–92.
  • Petersen (1976) Nancy S Petersen. 1976. An expected utility model for “optimal” selection. Journal of Educational Statistics 1, 4 (1976), 333–358.
  • Petersen and Novick (1976) Nancy S Petersen and Melvin R Novick. 1976. An evaluation of some models for culture-fair selection. Journal of Educational Measurement 13, 1 (1976), 3–29.
  • Phillips (2016) S E Phillips. 2016. Legal Aspects of Test Fairness. In Fairness in Educational Assessment and Measurement, Neil J Dorans and Linda L Cook (Eds.). Routledge, 239–268.
  • Rice and Baptiste (1994) Mitchell F Rice and Brad Baptiste. 1994. Race Norming, Validity Generalization, and Employment Testing. Handbook of Public Personnel Administration 58 (1994), 451.
  • Ryu et al. (2018) Hee Jung Ryu, Hartwig Adam, and Margaret Mitchell. 2018. InclusiveFaceNet: Improving Face Attribute Detection with Race and Gender Diversity. In Workshop on Fairness, Accountability and Transparency in Machine Learning.
  • Samuda (1998) Ronald J Samuda. 1998. Psychological testing of American minorities: Issues and consequences. Vol. 10. Sage.
  • Sawyer et al. (1976) Richard L Sawyer, Nancy S Cole, and James WL Cole. 1976. Utilities and the issue of fairness in a decision theoretic model for selection. Journal of Educational Measurement 13, 1 (1976), 59–76.
  • Scheuneman (1979) Janice Scheuneman. 1979. A method of assessing bias in test items. Journal of Educational Measurement 16, 3 (1979), 143–152.
  • Shah and Peters (2018) Rajen D Shah and Jonas Peters. 2018. The Hardness of Conditional Independence Testing and the Generalised Covariance Measure. arXiv preprint arXiv:1804.07203 (2018).
  • Simoiu et al. (2017) Camelia Simoiu, Sam Corbett-Davies, Sharad Goel, et al. 2017. The problem of infra-marginality in outcome tests for discrimination. The Annals of Applied Statistics 11, 3 (2017), 1193–1216.
  • Thomas (1973) Charles L Thomas. 1973. The Overprediction Phenomenon among Black Collegians: Some Prelinimary Considerations. (1973).
  • Thorndike (1971) Robert L Thorndike. 1971. Concepts of culture-fairness. Journal of Educational Measurement 8, 2 (1971), 63–70.
  • Vargha et al. (1996) András Vargha, Tamas Rudas, Harold D Delaney, and Scott E Maxwell. 1996. Dichotomization, partial correlation, and conditional independence. Journal of Educational and Behavioral statistics 21, 3 (1996), 264–282.
  • Vars and Bowen (1998) Frederick E Vars and William G Bowen. 1998. Scholastic aptitude test scores, race, and academic performance in selective colleges and universities. The Black-White test score gap (1998), 457–79.
  • West-Faulcon (2011) Kimberly West-Faulcon. 2011. Fairness Feuds: Competing Conceptions of Title VII Discriminatory Testing. Wake Forest L. Rev. 46 (2011), 1035.
  • Williams et al. (1980) Robert L Williams, William Dotson, Patricia Don, and Willie S Williams. 1980. The war against testing: A current status report. The Journal of Negro Education 49, 3 (1980), 263–273.
  • Willingham and Cole (2013) Warren W Willingham and Nancy S Cole. 2013. Gender and fair assessment. Routledge.
  • Zafar et al. (2017) Muhammad Bilal Zafar, Isabel Valera, Manuel Rodriguez, Krishna Gummadi, and Adrian Weller. 2017. From parity to preference-based notions of fairness in classification. In Advances in Neural Information Processing Systems. 229–239.
  • Zhang et al. (2018) Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. 2018. Mitigating Unwanted Biases with Adversarial Learning. (2018).
  • Zhu et al. (2003) Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. 2003. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03). 912–919.
  • Zwick and Dorans (2016) Rebecca Zwick and Neil J Dorans. 2016. Philosophical Perspectives on Fairness in Educational Assessment. In Fairness in Educational Assessment and Measurement, Neil J Dorans and Linda L Cook (Eds.). Routledge, 267–281.

Appendix a: Additional definitions of test fairness

This appendix provides some details of fairness definitions included in Table 2 that were not introduced in the text of Section 2.

Einhorn and Bass

In 1971, Einhorn and Bass (Einhorn and Bass, 1971) noted that even if Cleary’s criterion is satisfied, different rates of false positives and false negatives may be achieved for different subgroups due to differences in standard errors of estimate for the two subgroups. That is, differences in variability around the common line of regression leads to different false positive and false negative rates. To address this, they propose a criterion based on achieving equal false discovery rate, or as they put it, “designated risk”, at the decision boundary. That is, Prob(Y>y*|R=ra*,A=a) is constant for all subgroups a.

Darlington’s “culturally optimum”

Darlington (Darlington, 1971) proposes that the subjective value that one places on test validity (related to accuracy) and diversity can be scenario-specific. He proposes a technique for eliciting these value judgements, leading to a variable k which measures the amount of tradeoff in validity that is acceptable to increase diversity. He proposes that the “culturally optimum” test is one that maximizes ρX(Y-kC).

Jones

In 1973, Jones (Jones, 1973) proposed a “general standard” of fairness that is related to Thorndike’s (and hence also related quota-based definitions of fairness). In Jones’ criterion, candidates are ranked in descending order both by test score and by ground truth. If an equal proportion of candidates from the subgroup are present in the top n% of both ranked lists then the test is fair “at position n”. Jones’ “general standard” of fairness requires that this hold for all values of n. Jones assumes a regression model relating test scores to ground truth, and also defines a weaker “mean-fair” criterion for a subgroup that “the group’s average predicted score equals its average performance score on the [ground truth].”