### Abstract

The distribution of health care payments to insurance plans has substantialconsequences for social policy. Risk adjustment formulas predict spending inhealth insurance markets in order to provide fair benefits and health carecoverage for all enrollees, regardless of their health status. Unfortunately,current risk adjustment formulas are known to undercompensate payments tohealth insurers for specific groups of enrollees (by underpredicting theirspending). Much of the existing algorithmic fairness literature for groupfairness to date has focused on classifiers and binary outcomes. To improverisk adjustment formulas for undercompensated groups, we expand on conceptsfrom the statistics, computer science, and health economics literature todevelop new fair regression methods for continuous outcomes by buildingfairness considerations directly into the objective function. We additionallypropose a novel measure of fairness while asserting that a suite of metrics isnecessary in order to evaluate risk adjustment formulas more fully. Our dataapplication using the IBM MarketScan Research Databases and simulation studiesdemonstrate that these new fair regression methods may lead to massiveimprovements in group fairness with only small reductions in overall fit.

### Quick Read (beta)

# Fair Regression for Health Care Spending

###### Abstract

The distribution of health care payments to insurance plans has substantial consequences for social policy. Risk adjustment formulas predict spending in health insurance markets in order to provide fair benefits and health care coverage for all enrollees, regardless of their health status. Unfortunately, current risk adjustment formulas are known to undercompensate payments to health insurers for specific groups of enrollees (by underpredicting their spending). Much of the existing algorithmic fairness literature for group fairness to date has focused on classifiers and binary outcomes. To improve risk adjustment formulas for undercompensated groups, we expand on concepts from the statistics, computer science, and health economics literature to develop new fair regression methods for continuous outcomes by building fairness considerations directly into the objective function. We additionally propose a novel measure of fairness while asserting that a suite of metrics is necessary in order to evaluate risk adjustment formulas more fully. Our data application using the IBM MarketScan Research Databases and simulation studies demonstrate that these new fair regression methods may lead to massive improvements in group fairness with only small reductions in overall fit.

trees

Keywords: Constrained regression, Penalized regression, Risk adjustment, Fairness

## 1 Introduction

### 1.1 Risk Adjustment

Risk adjustment is a method for correcting payments to health insurers such that they reflect the cost of their enrollees relative to enrollee health. It is implemented by most federally regulated health insurance markets in the United States, including Medicare Advantage and the individual health insurance Marketplaces created by the Affordable Care Act, to prevent losses to insurers who take on sicker enrollees (Pope et al., 2004, McGuire et al., 2013, Kautter et al., 2014). Current risk adjustment formulas use ordinary least squares (OLS) linear regression to predict health plan payments with select demographic information and diagnosis codes from medical claims. These OLS-based formulas are then typically evaluated with overall measures of statistical fit, such as ${R}^{2}$.

While ${R}^{2}$ is an important benchmark for evaluating global fit, it lacks information on other dimensions. As a result, risk adjustment has been criticized for not incentivizing efficient payment systems, spending, or population health management (Ash and Ellis, 2012, Layton et al., 2017) and for poorly estimating health costs for some groups by underpredicting spending. Underpredicting spending leads to undercompensation to the insurer, and there is evidence that insurers adjust the prescription drugs, services, and providers they cover (i.e., benefit design) to make health plans less attractive for enrollees in undercompensated groups (Shepard, 2016, Carey, 2017, Geruso et al., 2017). Examples of undercompensated groups include enrollees with specific medical conditions, high-cost enrollees, and partial-year enrollees (van Kleef et al., 2013, Montz et al., 2016, Ericson et al., 2017). Recent research has also shown that health plan insurers have the ability to identify undercompensated groups (Jacobs and Sommers, 2015, Geruso et al., 2017, Rose et al., 2017, Withagen-Koster et al., 2018).

### 1.2 Algorithmic Fairness

A typical algorithmic fairness problem in computer science has an outcome $Y$ and input vector $\bm{X}$ that includes a protected class or sensitive attribute $A\subset \bm{X}$. The goal is to create an estimator for the function $f(\bm{X})=Y$ that maps $\bm{X}$ to $Y$, while aiming to ensure that the function is fair for protected class $A$. Excluding $A$ from input $\bm{X}$ is often insufficient as $A$ tends to be correlated with other predictors in $\bm{X}$. Protected class $A$ may be a group defined by race, age, or gender that has been legally protected from discrimination, but in our case it will be a group defined by a health condition.

The most commonly used measures of fairness are based on the notion of group fairness, striving for similarity in predicted outcomes or errors for groups. While group fairness ignores individual violations of fairness that may occur within the protected class, individual fairness instead asks that similar people are treated similarly (Dwork et al., 2012, Zemel et al., 2013). Definitions of group fairness for binary $Y\in \{0,1\}$ (we will consider $Y\in \mathbb{R}$) have been studied extensively, and include statistical parity (Zemel et al., 2013), equalized odds (Hardt et al., 2016), equalized opportunity (Hardt et al., 2016, Kusner et al., 2018), and predictive parity (Chouldechova, 2017). Additional definitions can be found elsewhere (Zliobaite, 2015, Kusner et al., 2018, Mitchell and Shadlen, 2018). There are tradeoffs involved in selecting a fairness metric, and ensuring fairness based on one definition also does not necessarily guarantee a satisfying outcome with respect to other outcomes. It is often impossible to satisfy multiple definitions of fairness and the most appropriate fairness metric for a problem is context dependent (Kleinberg et al., 2016, Chouldechova, 2017, Berk et al., 2017b).

### 1.3 Our Contribution

In this paper, we expand on concepts from statistics, computer science, and health economics, proposing new estimation methods and measures to improve risk adjustment formulas for undercompensated groups. We consider risk adjustment formulas unfair if they incentivize differential treatment for undercompensated groups via benefit design. This has been referred to in the computer science fairness literature as disparate impact (Barocas and Selbst, 2016), which means that, despite the goals of risk adjustment being fair, the formula results in unfair outcomes for undercompensated groups.

Our setting diverges from existing fairness work in three key ways. First, much of the fairness literature in computer science deals with classifiers and binary decision outcomes (Chouldechova and Roth, 2018). We propose new fair regression estimators for continuous outcomes that reduce residual errors for an undercompensated group by building fairness considerations directly into the objective function. Second, common evaluation measures for fairness have not been generalized to the continuous case. As a result, we extend definitions of fairness from the computer science literature for risk adjustment while additionally considering existing measures in health economics. Third, fairness methods are frequently tested on a set of well-studied datasets. Our work explores how fairness can be applied in new ways to health plan risk adjustment data.

## 2 Fairness for Risk Adjustment

Many concepts from the fairness literature exist in the field of health economics albeit under a different name. Methods for addressing fairness are often separated into three categories based on the point in the learning process at which fairness is addressed: the preprocessing, fitting, or postprocessing phase. We briefly synthesize the methods within each of these categories and discuss how they pertain to risk adjustment. Our paper will then focus on the fitting phase.

If the data are inherently biased, then preprocessing techniques are an attractive solution. These methods create fair datasets by transforming or changing the data so that it is no longer biased (Kamiran and Calders, 2009, Zliobaite et al., 2011, Zemel et al., 2013, Calmon et al., 2017, Johndrow and Lum, 2017). In health economics, it has been shown that spending patterns among various group may be undesirable due to the current plan benefit system, and by using observed spending data, we reinforce these unfair spending patterns. A recent study explored this concept by transferring funds to undercompensated groups in the raw data in order to promote more ideal spending patterns (Bergquist et al., 2018).

One of the most common fitting phase approaches in health economics attempts to fix group undercompensation by adding new variables representative of the groups in the risk adjustment formula (van Kleef et al., 2013). While this is a straightforward idea, it can be problematic if those variables are unavailable, incentivize over- or underutilization of health services, or the risk adjustment formula does not recognize the improvement (Rose and McGuire, 2018). Fitting techniques in fairness include separate formulas for protected classes as well as fairness penalty terms or constraints (Kamishima et al., 2012, Berk et al., 2017a, Zafar et al., 2017a, b, Bechavod and Ligett, 2018, Dwork et al., 2018). We see intersections of these areas in the health economics literature with separate formulas for enrollees with mental health and substance use disorders (MHSUD) (Shrestha et al., 2018, van Kleef et al., 2018) and constrained regression to reduce undercompensation for specific groups (van Kleef et al., 2017). In risk adjustment, separate formulas to predict spending are already used in practice for infants and adults due to known differences in spending patterns. Nonparametric statistical machine learning methods to enhance estimation accuracy in risk adjustment have also been explored for the fitting stage (Rose, 2016, Shrestha et al., 2018, Park and Basu, 2018), but none of these tools are currently deployed in the U.S. health care system.

Postprocessing techniques modify the results after fitting by, for example, creating specific classification thresholds for different groups (Bansal et al., 2014, Hardt et al., 2016, Kleinberg et al., 2018, El Mhamdi et al., 2018). These methods from the fairness literature separate fit from fairness objectives and allow use of the same prediction function for multiple fairness objectives. Reinsurance in health economics, paying insurers for a portion of the costs of high-cost enrollees, can be considered postprocessing in that it reduces undercompensation for high-risk enrollees (McGuire and van Kleef, 2018).

The remainder of this section describes our approach to fair regression estimation. This involves a suite of fairness measures for evaluating new and existing regression tools in an effort to improve risk adjustment formulas for undercompensated groups. While our main goal is to understand whether estimation methods beyond OLS, including those we newly propose, improve risk adjustment, we also wish to focus on interpretability for stakeholders, such as government agencies, insurers, providers, and enrollees. Therefore, constrained and penalized regressions were natural choices to enforce fairness in risk adjustment for undercompensated groups.

### 2.1 Measures

Let $g$ be the set containing all ${n}_{g}$ enrollees in the undercompensated group, indexed by $i$, and ${g}^{c}$ the complement with all ${n}_{c}$ enrollees not in the undercompensated group, indexed by $j$. Overall sample size, $N={n}_{g}+{n}_{c}$, is indexed by $k$. Group undercompensation is a result of large average group residuals in the risk adjustment formula. We define fairness as a function of these residual errors given that many undercompensated groups have substantially higher average health care costs. Thus, enforcing similar predicted outcomes between $g$ and ${g}^{c}$ would be unfair to both.

The measures we consider assume that the data include unbiased $Y$, which may not be the case in practice. Additionally, fairness is frequently assessed for one or two groups, as we also do here. In reality, we are often concerned about fairness for many groups. This requires the ability to define all meaningful groups, which is not always an objective task. We return to this issue in our discussion.

Group Residual Difference. In the fairness literature, one definition for continuous outcomes is that persons with similar $Y$ should have similar predicted outcomes $\widehat{Y}$ regardless of their protected class (Berk et al., 2017a). This relies on a distance function $d$ to ensure that people who are ‘close’ have similar outcomes. We extend this definition for risk adjustment by comparing residuals rather than predicted outcomes for the two groups:

$${\left(\frac{1}{{n}_{g}{n}_{c}}\sum _{i\in g,j\in {g}^{c}}d({Y}_{i},{Y}_{j})({Y}_{i}-{\widehat{Y}}_{i}-({Y}_{j}-{\widehat{Y}}_{j}))\right)}^{2}.$$ |

We refer to this new measure as the group residual difference. However, this measure is not practical to implement at scale in risk adjustment, which often involves millions of enrollees. The group residual difference requires comparing the residual of every enrollee in the undercompensated group to every other enrollee in the complement group. This scaling issue was also noted in the earlier work our metric extends upon (Berk et al., 2017a). We therefore consider an existing related measure below and do not implement the group residual difference. Our group residual difference metric can be useful in settings where $N$ is smaller.

Mean Residual Difference. Because we want solutions that are efficient and scalable, comparing average residuals will be more practical for our application. This concept, which has been referred to as the mean residual difference in the computer science literature, aims to enhance fairness by reducing the mean residual to zero (Calders et al., 2013):

$$\frac{1}{{n}_{g}}\sum _{i\in g}({\widehat{Y}}_{i}-{Y}_{i})-\frac{1}{{n}_{c}}\sum _{j\in {g}^{c}}({\widehat{Y}}_{j}-{Y}_{j}).$$ |

Net Compensation. Net compensation is a related measure from the health economics literature on the same scale as the mean residual difference (Layton et al., 2017). However, it does not contain a term for the mean residual in the complement group:

$$\frac{1}{{n}_{g}}\sum _{i\in g}({\widehat{Y}}_{i}-{Y}_{i}).$$ |

Thus, this measure focuses on a reduction in the residuals in $g$ rather than similarity in residuals between the groups. A parallel net compensation measure can be calculated for ${g}^{c}$.

We highlight that we intentionally take the difference ${\widehat{Y}}_{i}-{Y}_{i}$ rather than ${Y}_{i}-{\widehat{Y}}_{i}$ so that undercompensation for those in $g$ aligns with a negative value of net compensation, in line with previous literature (e.g., Bergquist et al., 2018). This is reflected in the mean residual difference definition above as well. We do not maintain this ordering for the corresponding estimators in Section 2.2 as we wish to penalize large undercompensation in net compensation penalized regression by adding to the squared error and the squared term for mean residual difference penalized regression negates the ordering distinction.

Predictive Ratios. Predictive ratios are commonly used to quantify the underpayment for specific groups in risk adjustment (Pope et al., 2004):

$$\frac{{\sum}_{i\in g}{\widehat{Y}}_{i}}{{\sum}_{i\in g}{Y}_{i}}.$$ |

Whereas net compensation provides the absolute magnitude of the loss in dollars, predictive ratios provide the relative size of the loss. Predictive ratios can also be created for ${g}^{c}$.

Fair Covariance. Other fairness work creates a measure based on the idea that to be fair, the predicted outcome (or residual error) and protected class must be independent. Using the covariance between the predicted outcome (or residual error) and the protected class as a proxy for independence, that work establishes a fairness measure (Zafar et al., 2017a, b). Because this prior metric assumes outcomes are classified into discrete categories, we extend the definition to define a new measure of fair covariance for residual errors with continuous $Y$. Our measure is given by:

$$ |

where $A\in \{0,1\}$ is the random variable indicating membership in $g$ and ${c}^{*}$ is the covariance of the undercompensated group and the OLS residual. This covariance measure allows one to see the empirical signal for systematic undercompensation through residual covariance and can also be scaled by ${c}^{*}$ such that it is bounded between 0 and 1.

Global Fit. In addition to fairness measures, we also evaluate overall fit with the traditional measure used in risk adjustment, which is ${R}^{2}$:

$$1-\frac{{\sum}_{k}{({Y}_{k}-{\widehat{Y}}_{k})}^{2}}{{\sum}_{k}{({Y}_{k}-{\overline{Y}}_{k})}^{2}}.$$ |

Given current policymaker prioritization of global metrics, it is important to compare estimators with both group and overall fit measures to understand the impact on global fit when seeking fairness for undercompensated groups.

### 2.2 Estimation Methods

We present five methods that incorporate a fairness objective to improve risk adjustment formulas for undercompensated groups, two of which are new contributions. This is accomplished with either constraints or penalties, and these five methods will also be compared to the standard practice OLS estimator. We have a continuous spending outcome $Y$, a vector of binary health variables $\bm{H}=({H}_{1},\mathrm{\dots},{H}_{T})$, an input vector $\bm{X}=\{\text{female},\text{age},\bm{H}\}$, and a coefficient vector $\bm{\theta}$ indexed by $p$. For OLS, we aim to solve the following regression problem:

$\underset{\theta}{minimize}\left\{{\displaystyle \sum _{k}}{\left({Y}_{k}-{\displaystyle \sum _{p}}{\theta}_{p}{X}_{kp}\right)}^{2}\right\}.$ | (1) |

Average Constrained Regression. A previously proposed constrained regression method for risk adjustment requires that the estimated average spending for the undercompensated group is equal to the average spending, which means that net compensation for the undercompensated group is zero (van Kleef et al., 2017). This is achieved by including a constraint:

$\underset{\theta}{minimize}\left\{{\displaystyle \sum _{k}}{\left({Y}_{k}-{\displaystyle \sum _{p}}{\theta}_{p}{X}_{kp}\right)}^{2}\right\},$ | ||

$\text{subject to}{\displaystyle \frac{1}{{n}_{g}}}{\displaystyle \sum _{i\in g}}{Y}_{i}={\displaystyle \frac{1}{{n}_{g}}}{\displaystyle \sum _{i\in g}}\left({\displaystyle \sum _{p}}{\theta}_{p}{X}_{ip}\right).$ |

The above constraint has been applied in the health economics literature to reduce undercompensation for select groups (van Kleef et al., 2017, Bergquist et al., 2018).

Weighted Average Constrained Regression. The next existing method relaxes the previous constraint, allowing the estimated spending to be a weighted average of the average spending of the undercompensated group and the estimated spending under unconstrained OLS:

$\underset{\theta}{minimize}\left\{{\displaystyle \sum _{k}}{\left({Y}_{k}-{\displaystyle \sum _{p}}{\theta}_{p}{X}_{kp}\right)}^{2}\right\},$ | ||

$\text{subject to}{\displaystyle \frac{1}{{n}_{g}}}{\displaystyle \sum _{i\in g}}\left({\displaystyle \sum _{p}}{\theta}_{p}{X}_{ip}\right)={\displaystyle \frac{(1-\alpha )}{{n}_{g}}}{\displaystyle \sum _{i\in g}}{Y}_{i}+{\displaystyle \frac{\alpha}{{n}_{g}}}{\displaystyle \sum _{i\in g}}\left({\displaystyle \sum _{p}}{\theta}_{p}^{OLS}{X}_{ip}\right),$ |

where ${\bm{\theta}}^{OLS}$ is the coefficient vector from the OLS given in formula (1). The hyperparameter $\alpha \in [0,1]$ is a weighting factor. When $\alpha =0$, this method is equivalent to average constrained regression, and when $\alpha =1$ it is equivalent to OLS. Weighted average constrained regression has been shown to reduce undercompensation for select groups in the Netherlands risk adjustment formula (van Kleef et al., 2017).

Covariance Constrained Regression. The class of covariance methods we consider impose a constraint on the residual by requiring that the covariance between the residual and the protected class is close to zero (Zafar et al., 2017a, b). We extend these techniques to propose new methods for our setting where we have a continuous residual, which has not been previously explored. In order to solve the optimization problem, we convert it into a convex problem. We simplify the covariance as follows:

$Cov(A,Y-\bm{\theta}\bm{X})$ | $=E[(A-E[A])(Y-\bm{\theta}\bm{X}-E[Y-\bm{\theta}\bm{X}])]$ | ||

$=E[(A-E[A])(Y-\bm{\theta}\bm{X})]$ | |||

$\approx {\displaystyle \frac{1}{N}}{\displaystyle \sum _{k}}\left(({A}_{k}-P(A=1))({Y}_{k}-{\displaystyle \sum _{p}}{\theta}_{p}{X}_{kp})\right)$ | |||

$\approx {\displaystyle \frac{1}{N}}((1-P(A=1)){\displaystyle \sum _{i\in g}}({Y}_{i}-{\displaystyle \sum _{p}}{\theta}_{p}{X}_{ip})$ | |||

$-P(A=1){\displaystyle \sum _{j\in {g}^{c}}}({Y}_{j}-{\displaystyle \sum _{p}}{\theta}_{p}{X}_{jp})).$ |

Now that we have the covariance in the form of a convex problem, we can define what we need to solve:

$\underset{\theta}{minimize}\left\{{\displaystyle \sum _{k}}{\left({Y}_{k}-{\displaystyle \sum _{p}}{\theta}_{p}{X}_{kp}\right)}^{2}\right\},$ | ||

$$ | ||

$(1-P(A=1)){\displaystyle \sum _{i\in g}}({Y}_{i}-{\displaystyle \sum _{p}}{\theta}_{p}{X}_{ip})-P(A=1){\displaystyle \sum _{j\in {g}^{c}}}({Y}_{j}-{\displaystyle \sum _{p}}{\theta}_{p}{X}_{jp})\ge -c.$ |

Parallel to the literature for discrete categories (Zafar et al., 2017b), we set $c=m\times {c}^{*}$, where $m$ is a multiplicative factor $m\in [0,1]$ and ${c}^{*}$ is the covariance of the undercompensated group and the OLS residual. The upper bound for $c$ occurs at $m=1$, which is ${c}^{*}$.

As we are primarily concerned with the residual of the undercompensated group being too large, we choose to instead bound the covariance on one side in our implementation of this method. In other words, we constrain the covariance to be less than some percentage of the OLS covariance (as defined by the hyperparameter $m$). A one-sided constraint also yields faster optimization. The updated optimization problem is given by:

$$ |

Mean Residual Difference Penalized Regression. The relationship between penalized and constrained regressions is well recognized in statistics (Hastie et al., 2009), and one could equivalently reformulate the above constraints as penalties. Penalized regression has also been explored in the fairness literature. Calders et al. (2013) consider constrained formulations of their approaches, but propose the flexibility of penalization as an alternative due to the possibility of degenerate solutions with a high number of constraints. In their mean residual difference regression technique, one penalizes with large mean residual differences between the undercompensated group and the complement group. The coefficients minimize:

$$\sum _{k}{\left({Y}_{k}-\sum _{p}{\theta}_{p}{X}_{kp}\right)}^{2}+\lambda {\left(\frac{1}{{n}_{g}}\sum _{i\in g}\left({Y}_{i}-\sum _{p}{\theta}_{p}{X}_{ip}\right)-\frac{1}{{n}_{c}}\sum _{j\in {g}^{c}}\left({Y}_{j}-\sum _{p}{\theta}_{p}{X}_{jp}\right)\right)}^{2},$$ |

where hyperparameter $\lambda $ can be user-specified or chosen via cross-validation, and its magnitude will be on the same scale as $Y$.

Net Compensation Penalized Regression. In our second new method, rather than imposing a constraint, we also formulate a penalized regression. Our regression involves the inclusion of a custom penalty term in the minimization problem:

$$\sum _{k}{\left({Y}_{k}-\sum _{p}{\theta}_{p}{X}_{kp}\right)}^{2}+\lambda \left(\frac{1}{{n}_{g}}\sum _{i\in g}\left({Y}_{i}-\sum _{p}{\theta}_{p}{X}_{ip}\right)\right).$$ |

This penalty punishes estimators where the net compensation, or difference between the average spending and predicted spending for the undercompensated group, is large. We can alternatively present our new method as a constraint:

$\text{subject to}{\displaystyle \frac{1}{{n}_{g}}}{\displaystyle \sum _{i\in g}}\left({Y}_{i}-{\displaystyle \sum _{p}}{\theta}_{p}{X}_{ip}\right)\le z,$ |

where the hyperparameter $z$ is positive and has a one-to-one correspondence with, but is not equal to, $\lambda $ when the constraint is binding. We choose to primarily implement this method as a penalized regression to explore differences in performance with the mean residual difference penalized regression for the same values of $\lambda $. Simulation studies in Appendix A examine the performance of the constrained formulation.

### 2.3 Computational Implementation

Each of the six methods was evaluated to assess both statistical fit and fairness goals with 5-fold cross-validation using the suite of five measures defined in Section 2.1. OLS was implemented in the R programming language with the lm() function. All other estimators were optimized using the CVXR package. This package uses disciplined convex programming to solve optimization problems and allows users to specify novel constraints (Fu et al., 2018).

## 3 Health Care Spending Application

Our data application features the IBM MarketScan Research Databases. This set of databases contains enrollee-level claims, demographic information, and health plan spending for a sample of individuals (and their dependents) insured by private health plans and large employers across the country. In 2014, the IBM MarketScan Research Databases were used by the federal government to develop the risk adjustment formula for the individual health insurance Marketplaces. Thus, this data source is particularly policy relevant. Although the data are representative of only a subset of the U.S. health insurance market, our methods are appropriate for other markets and different application settings with continuous outcomes.

We selected a random sample of 100,000 enrollees from 2015-2016. Variables for 2015 included age, sex, and diagnosed health conditions with total annual expenditures from the year 2016. Diagnosed health conditions took the form of the established Hierarchical Condition Category (HCC) variables used for risk adjustment. HCCs were developed by the Department of Health and Human Services to group a selection of International Classification of Disease and Related Health Problems (ICD) codes into indicators for various health conditions (Pope et al., 2004, Kautter et al., 2014). We considered the 79 HCC variables currently used in Medicare Advantage risk adjustment formulas and retained the 62 HCCs that had at least 30 enrollees with the condition. Our sample of enrollees was 52% female and between the ages of 21 and 63, with median age 45. Mean and median annual expenditures per enrollee were $6,651 and $1,511, respectively.

### 3.1 Defining the Undercompensated Group

The undercompensated group we focused on for this data application was enrollees with MHSUD. We selected this group for two major reasons. First, individuals with MHSUD are known to have substantially undercompensated payments in current risk adjustment formulas (Montz et al., 2016). Second, about 20% of people in the United States have MHSUD, thus it is a priority area for policy change. We defined enrollees with MHSUD using Clinical Classification Software (CCS) categories. This classification system maps each MHSUD-related ICD code to a CCS category, unlike the HCCs, which only map a subset of MHSUD-related ICD codes. Based on CCS categories, 13.8% of the sample had a diagnosis code for MHSUD compared to 2.6% had we used HCCs. We note that we do not capture enrollees with MHSUD who do not have an ICD code for their condition(s). The mean annual expenditures for MHSUD enrollees in our sample were $11,520 versus $5,880 for enrollees without MHSUD (and $3,744 versus $1,274 for median annual expeditures).

### 3.2 Results

We compared each method to determine which estimators were best at reducing undercompensation for enrollees with MHSUD, and at what cost to overall statistical fit. In Table 1, we report the top estimators with respect to fairness for each of the six methods, having selected the hyperparameter value that optimizes the fairness measures (for those that have these parameters). Comparisons of global fit versus group fairness for the three methods with variation in performance by hyperparameter can be found in Figure 1.

OLS had a cross-validated ${R}^{2}$ measure of 12.9%, a predictive ratio of 0.837 for individuals with MHSUD, and underestimated average MHSUD spending by -$1,872, with a mean residual difference of -$2,165. The fair covariance measure was 256. Average spending for enrollees without MHSUD was overestimated by $293 with a predictive ratio of 1.050. OLS had the worst performance along all fairness metrics while producing an ${R}^{2}$ only trivially higher than the competing methods.

We found the best improvement in fairness for MHSUD using the existing average constrained regression and our new covariance constrained regression. These two methods had similar, although not identical performance, and reduced the average undercompensation for enrollees with MHSUD to -$46 (versus -$1,872 in the OLS), a relative improvement of 98%. They also increased the predictive ratio from 0.837 to 0.996. Enrollees without MHSUD were overestimated by only $4 and had a predictive ratio of 1.001. Both methods reduced the fair covariance measure from 256 to 6. Unsurprisingly, these two estimators were also the worst performers on overall fit as measured by ${R}^{2}$, although it was a loss of only 4%, from 12.9% to 12.4%. This small 0.5 percentage point loss in ${R}^{2}$ may be tolerable to policymakers.

Predictive | Net | Mean | |||||
---|---|---|---|---|---|---|---|

Ratio | Compensation | Residual | Fair | ||||

Method | ${R}^{2}$ | $g$ | ${g}^{c}$ | $g$ | ${g}^{c}$ | Difference | Covariance |

Average | 12.4% | 0.996 | 1.001 | -$46 | $4 | -$50 | 6 |

Covariance | 12.4 | 0.996 | 1.001 | -46 | 4 | -50 | 6 |

Net Compensation${}^{\u2020}$ | 12.5 | 0.980 | 1.006 | -232 | 34 | -266 | 31 |

Weighted Average${}^{\mp}$ | 12.6 | 0.964 | 1.011 | -411 | 62 | -473 | 56 |

Mean Residual Difference${}^{\oplus}$ | 12.8 | 0.895 | 1.032 | -1208 | 188 | -1396 | 164 |

OLS | 12.9 | 0.837 | 1.050 | -1872 | 293 | -2165 | 256 |

${}^{\u2020}\lambda =10000$, ${}^{\mp}\alpha =0.2$, ${}^{\oplus}\lambda =30000$

Note: Measures calculated based on cross-validated predicted values and sorted on net compensation. Best performing hyperparameters for each estimator (with respect to fairness measures) are displayed. Performance for covariance method was the same for all $m$. ${g}^{c}$ is the complement of g.

Recall that the weighted average constrained regression is a compromise estimator between the OLS and average constrained regression. As $\alpha $ approached one in the first panel of Figure 1, the metrics more closely resembled the OLS results. As $\alpha $ approached zero we saw values closer to the average constrained regression results, although $\alpha =0.2$ was not only dominated by the average constrained and covariance constrained regressions, but also the net compensation penalized regression with $\lambda =10000$.

The remaining two methods were regressions with customized penalty terms to punish unfair estimates. Our proposed net compensation penalized regression varied substantially by hyperparameter (see second panel in Figure 1), although was the third best performer overall at its optimal hyperparameter value. Large $\lambda $ values yielded extremely poor performance on both overall fit and fairness. At $\lambda =20000$, ${R}^{2}$ dropped by 12% to 11.9%, and when $\lambda $ increased to $30000$, ${R}^{2}$ dropped to 9%, a relative reduction of 29%. These two $\lambda $ values led to a large overcompensation for enrollees with MHSUD. The covariance was also negative, indicating that the residual value for MHSUD was systematically too high. The mean residual difference penalized regression was less sensitive to hyperparameters compared to the net compensation penalized regression (see third panel in Figure 1). The best performance for mean residual difference penalized regression was at $\lambda =30000$; it improved on the MHSUD predictive ratio for OLS by 7% (from 0.837 to 0.895) with an ${R}^{2}$ loss of less than 1%. However, the best performing net compensation penalized regression had an 81% improvement over the best performing mean residual difference penalized regression when comparing MHSUD net compensation, as well as large improvements in predictive ratios (0.895 versus 0.980) and fair covariance (164 versus 31).

We also examined the HCC variable coefficients for the best performing estimators, the average constrained and covariance constrained regressions, in comparison to OLS. Risk adjustment coefficients communicate incentives to insurers and providers related to prevention and care. For example, coefficients that do not reflect costs can impact an insurer’s incentives in creating their plan offerings. Coefficients for the average constrained and covariance constrained regressions were nearly identical when rounded to the nearest whole dollar, thus we display OLS versus covariance constrained regression in Figure 2. We considered the largest five increases and largest five decreases from OLS to covariance constrained regression, and observed sizable increases in the estimated coefficients associated with MHSUD. The largest relative increase was 180% for “Schizophrenia.” Relative decreases were much smaller.

## 4 Simulation Study

A set of simulation scenarios was developed to explore how these regression methods perform in other settings. We generated a population of 100,000 observations with two continuous outcomes ${Y}_{1}$ and ${Y}_{2}$ that were each a function of covariates in $\bm{X}=({X}_{1},{X}_{2},\mathrm{\dots},{X}_{9})$ and two distinct yet partially overlapping protected classes (${A}_{1}$ and ${A}_{2}$) that depended on variables in $\bm{X}$. Scenario 1 considered a complex functional form for ${Y}_{1}$ and regression estimators that were misspecified, including omitted $\bm{X}$ variables. Scenario 2 examined a less complex functional form in ${Y}_{2}$ and regression estimators that were misspecified, including additional noise variables but no omitted $\bm{X}$ variables. A third scenario is discussed in Appendix A, along with complete details for the simulated population and first two scenarios. For each scenario, we drew 500 samples of $N=1,000$ and $N=10,000$ observations from the simulated population.

Selected results are presented in Figure 3, which includes OLS and those methods that improved fairness measures for protected class ${A}_{1}$ with a relative ${R}^{2}$ loss $\le 10$%. Notably, average constrained and covariance constrained regression, the tied top estimators in our data analysis, do not appear. This was common across settings; average constrained and covariance constrained regression often struggled with functional form misspecification. However, net compensation penalized regression, which performed well in our data analysis, also performed well in the simulations with respect to achieving metric balance between global fit decreases and group fit increases. Full results are available in Appendix A.

## 5 Discussion

We proposed new fair regression methods aiming to improve risk adjustment for undercompensated groups and asserted that a broader set of metrics is needed. As expected, there was no single method that performed the best across all the measures. One of our newly proposed techniques, net compensation penalized regression, had strong performance with respect to fairness and global fit in both the data analysis and simulations. Selecting the ‘best’ method relies on subjective decisions regarding how to balance group fairness versus overall fit tradeoffs. Improvements in fairness resulted in subsequent decreases in ${R}^{2}$. However, for many estimators, particularly in our data analysis, improvements in fairness were larger than the subsequent decreases in overall fit. This suggests that if we allow for a slight drop in overall fit, we could greatly increase compensation for MHSUD. Policymakers need to consider whether they are willing to sacrifice small reductions in global fit for large improvements in fairness.

We used a sample of enrollees in our demonstration. At scale in a policy implementation, data from millions of enrollees would be used to estimate health spending. Solutions to group undercompensation must be scalable, and current software may or may not yet be capable of handling the sample sizes required. We tested the CVXR optimization package on larger samples and found that it was able to find solutions on a sample of 1,000,000 observations over the span of 3 days (versus 7 hours for the 100,000 enrollee sample). While the optimization results were not within the ideal optimal threshold, they still converged and the results were similar to those presented in this paper, which is promising. Future work includes additional studies regarding scalability. In our analyses, we also preselected hyperparameter values. A more thorough approach, with possibly improved results, would explore the hyperparameter space in an automated way to select values that optimize over joint fairness and fit objectives. As a general guideline, we found that $\lambda =N/10$ yielded reasonable metric balance for our newly proposed net compensation penalized regression.

We focused on one group that risk adjustment is known to disadvantage, but it is important to extend such strategies to multiple groups. Improvements for one group could result in subsequent undercompensation for other groups, and balancing fairness across an increasing number of groups will be a continuous challenge in risk adjustment. Our simulations examined two protected classes, and we found that improving fairness for one group did not generally help or harm the second group, but this will not always be the case. Even the act of defining the groups poses a problem, as this can be subjective, potentially favoring larger groups with well-funded advocacy organizations. Undercompensation could be found in many other lesser-known groups. However, we can only measure undercompensation for groups that are identified by available data, and socioeconomic information, such as poverty and housing, are not available at the individual level for risk adjustment (Ellis et al., 2018).

Broadly, data-driven decisions have come under scrutiny for perpetuating human biases and disparities, which certainly exists in risk adjustment. Arguments for a more comprehensive view of research results is increasing among scientific researchers today (O’Neil, 2017, Gibney, 2018). Recent work argues that evaluating methods from a purely statistical standpoint can lead to negative consequences, and that policy aims should be better incorporated into our research (Corbett-Davies and Goel, 2018). Our article follows in this spirit, and we presented additional estimators and comparisons across multiple measures for the numerous (sometimes competing) goals of risk adjustment. While we worked within the specific context of risk adjustment, the fairness methods and measures discussed here have implications for other settings with continuous outcomes, which have been understudied relative to binary outcomes.

## Data and Code

The IBM MarketScan Research Databases used in Section 3 are not available for public dissemination as they contain protected patient information and we were granted access via a restricted data use agreement. Instead, we provide simulated analysis data that preserves important relationships of the original data while protecting the original content. We also provide simulation data and code to reproduce the simulation study from Section 4. All of these materials are available online: https://github.com/zinka88/Fair-Regression.

## Appendix A: Simulation Study Details

As described in Section 4 of the main text, our simulation study population of 100,000 observations considered covariates $\bm{X}=({X}_{1},{X}_{2},\mathrm{\dots},{X}_{9})$, two protected class indicator variables (${A}_{1}$ and ${A}_{2}$), and two continuous outcome variables (${Y}_{1}$ and ${Y}_{2}$). ${X}_{1}$ was generated from a Normal distribution with mean 70 and standard deviation 15. Both ${X}_{2}$ and ${X}_{3}$ had Poisson distributions, with $\lambda $ values of 10 and 35, respectively. The last six covariates (${\bm{X}}_{4:9}$) were drawn from Bernoulli distributions with probabilities 0.5, 0.1, 0.05, 0.8, 0.03, and 0.2. ${A}_{1}$ and ${A}_{2}$ were also drawn from Bernoulli distributions, but depended on other generated variables in the population:

${A}_{1}$ | $\sim $ | $\text{Bernoulli}({X}_{4}\times {X}_{9}/2+.01)$ | ||

${A}_{2}$ | $\sim $ | $\text{Bernoulli}({X}_{4}^{2}/3+.05).$ |

They had prevalence rates of $6\%$ and $22\%$, respectively, with 2.1% overlap. Both outcomes, ${Y}_{1}$ and ${Y}_{2}$, depended on variables in $\bm{X}$ as well as ${A}_{1}$ and ${A}_{2}$:

${Y}_{1}$ | $=$ | $({X}_{1}\times {X}_{2}\times {X}_{4})+({A}_{1}\times {X}_{2}\times {X}_{7})+({X}_{3}\times {X}_{5}\times {X}_{6})+{2}^{({X}_{8}\times {X}_{9})}$ | ||

$+({A}_{1}\times {X}_{1}\times {X}_{5})+({A}_{2}\times {X}_{3}\times {X}_{5})$ | ||||

${Y}_{2}$ | $=$ | ${X}_{1}+{X}_{2}+({X}_{3}\times {X}_{4}\times {X}_{5})+({A}_{1}\times {X}_{3})+({A}_{1}\times {A}_{2}\times {X}_{1}).$ |

We estimated regressions in three scenarios representing differing types of functional form misspecification:

$\text{Scenario 1:}{Y}_{1}$ | $=$ | ${\beta}_{1}{X}_{1}+{\beta}_{2}{X}_{2}+{\beta}_{3}{X}_{3}+{\beta}_{4}{X}_{5}+{\beta}_{5}{X}_{6}+{\beta}_{6}{X}_{7}+{\beta}_{7}{X}_{8}+{\beta}_{8}{X}_{9}$ | ||

$\text{Scenario 2:}{Y}_{2}$ | $=$ | ${\gamma}_{1}{X}_{1}+{\gamma}_{2}{X}_{2}+{\gamma}_{3}{X}_{3}+{\gamma}_{4}{X}_{4}+{\gamma}_{5}{X}_{5}+{\gamma}_{6}{X}_{6}+{\gamma}_{7}{X}_{7}+{\gamma}_{8}{X}_{8}+{\gamma}_{9}{X}_{9}$ | ||

$\text{Scenario 3:}{Y}_{2}$ | $=$ | ${\zeta}_{1}{X}_{1}+{\zeta}_{2}{X}_{4}+{\zeta}_{3}{X}_{6}+{\zeta}_{4}{X}_{7}+{\zeta}_{5}{X}_{8}+{\zeta}_{6}{X}_{9}.$ |

Complete results for 500 draws from the population with $N=1,000$ and $N=10,000$ are given in Tables A1 and A2. Simulation data and complete analytic code to reproduce the simulation analyses are available online: https://github.com/zinka88/Fair-Regression.

Predictive | Net | |||||

Ratio | Compensation | Fair | ||||

Scenario | Method | ${R}^{2}$ | ${g}_{1}$ | ${g}_{1}$ | ${g}_{2}$ | Covariance |

1 | \colordarkgray Net Compensation, $\lambda =5000$ | -2901.0 | 5.96 | 3231 | -273 | -198.6 |

\colordarkgrayNet Compensation, $\lambda =1000$ | -106.9 | 1.63 | 408 | -270 | -25.0 | |

\colordarkgrayAverage | -8.8 | 0.98 | -15 | -270 | 0.9 | |

\colordarkgray Covariance, $m=0.2$ | -8.8 | 0.98 | -15 | -270 | 0.9 | |

\colordarkgray Mean Residual Difference, $\lambda =5000$ | -7.0 | 0.96 | -28 | -270 | 1.7 | |

\colordarkgray Mean Residual Difference, $\lambda =1000$ | -2.0 | 0.89 | -70 | -270 | 4.3 | |

\colordarkgray Weighted Average, $\alpha =0.2$ | -1.9 | 0.89 | -71 | -270 | 4.4 | |

Net Compensation Constraint, $z=0.2$ | 0.1 | 0.86 | -92 | -270 | 5.7 | |

Weighted Average, $\alpha =0.4$ | 3.5 | 0.80 | -128 | -269 | 7.9 | |

Weighted Average, $\alpha =0.6$ | 7.3 | 0.72 | -185 | -269 | 11.4 | |

Mean Residual Difference, $\lambda =100$ | 8.7 | 0.67 | -215 | -269 | 13.2 | |

Net Compensation, $\lambda =100$ | 9.1 | 0.65 | -227 | -269 | 14.0 | |

Weighted Average, $\alpha =0.8$ | 9.6 | 0.63 | -241 | -269 | 14.9 | |

Net Compensation Constraint, $\alpha =0.6$ | 9.6 | 0.62 | -245 | -269 | 15.1 | |

Net Compensation Constraint, $z=1$ | 10.4 | 0.54 | -298 | -269 | 18.3 | |

OLS | 10.4 | 0.54 | -298 | -269 | 18.4 | |

2 | \colordarkgray Net Compensation, $\lambda =5000$ | -3436.9 | 2.53 | 217 | 43 | -13.3 |

\colordarkgray Net Compensation, $\lambda =1000$ | -85.7 | 1.07 | 9 | 5 | -0.6 | |

\colordarkgray Average | -33.5 | 0.99 | -1 | 3 | 0.1 | |

\colordarkgray Covariance, $m=0.2$ | -33.4 | 0.99 | -1 | 3 | 0.1 | |

\colordarkgray Mean Residual Difference, $\lambda =5000$ | -26.5 | 0.98 | -3 | 3 | 0.2 | |

\colordarkgray Net Compensation Constraint, $z=0.2$ | -14.3 | 0.96 | -6 | 3 | 0.4 | |

\colordarkgray Mean Residual Difference, $\lambda =1000$ | -5.6 | 0.94 | -8 | 2 | 0.5 | |

\colordarkgray Weighted Average, $\alpha =0.2$ | -1.5 | 0.93 | -10 | 2 | 0.6 | |

Net Compensation Constraint, $\alpha =0.6$ | 17.2 | 0.89 | -16 | 1 | 1.0 | |

Weighted Average, $\alpha =0.4$ | 23.5 | 0.87 | -18 | 0 | 1.1 | |

Net Compensation Constraint, $z=1$ | 39.3 | 0.82 | -25 | -1 | 1.5 | |

Weighted Average, $\alpha =0.6$ | 41.3 | 0.82 | -26 | -1 | 1.6 | |

Mean Residual Difference, $\lambda =100$ | 45.9 | 0.80 | -29 | -2 | 1.8 | |

Weighted Average, $\alpha =0.8$ | 52.2 | 0.76 | -34 | -3 | 2.1 | |

Net Compensation, $\lambda =100$ | 54.4 | 0.74 | -37 | -3 | 2.3 | |

OLS | 56.0 | 0.70 | -43 | -4 | 2.6 | |

3 | \colordarkgrayNet Compensation, $\lambda =5000$ | -726.2 | 1.00 | 1 | 44 | 0.0 |

\colordarkgray Average | -582.7 | 0.97 | -5 | 39 | 0.3 | |

\colordarkgray Covariance, $m=0.2$ | -582.4 | 0.97 | -5 | 39 | 0.3 | |

\colordarkgray Net Compensation Constraint, $z=0.2$ | -472.3 | 0.94 | -9 | 35 | 0.6 | |

\colordarkgray Mean Residual Difference, $\lambda =5000$ | -395.8 | 0.91 | -12 | 32 | 0.8 | |

\colordarkgray Weighted Average, $\alpha =0.2$ | -358.0 | 0.90 | -14 | 31 | 0.9 | |

\colordarkgray Net Compensation Constraint, $\alpha =0.6$ | -283.7 | 0.87 | -18 | 27 | 1.1 | |

\colordarkgray Weighted Average, $\alpha =0.4$ | -183.0 | 0.83 | -24 | 22 | 1.5 | |

\colordarkgray Net Compensation Constraint, $z=1$ | -138.0 | 0.81 | -27 | 19 | 1.6 | |

\colordarkgray Mean Residual Difference, $\lambda =100$ | -121.7 | 0.80 | -28 | 18 | 1.7 | |

\colordarkgray Weighted Average, $\alpha =0.6$ | -57.7 | 0.77 | -33 | 13 | 2.0 | |

Net Compensation, $\lambda =1000$ | 12.0 | 0.71 | -42 | 6 | 2.6 | |

Weighted Average, $\alpha =0.8$ | 18.0 | 0.70 | -43 | 5 | 2.6 | |

Mean Residual Difference, $\lambda =100$ | 37.5 | 0.66 | -48 | 0 | 2.9 | |

Net Compensation, $\lambda =100$ | 43.5 | 0.64 | -51 | -3 | 3.1 | |

OLS | 44.0 | 0.63 | -52 | -4 | 3.2 |

Note: Measures calculated based on cross-validated predicted values and sorted on net compensation. Estimators with negative ${R}^{\mathrm{2}}$ values are in shaded text.

Predictive | Net | |||||
---|---|---|---|---|---|---|

Ratio | Compensation | Fair | ||||

Scenario | Method | ${R}^{2}$ | ${g}_{1}$ | ${g}_{1}$ | ${g}_{2}$ | Covariance |

1 | \colordarkgrayNet Compensation, $\lambda =5000$ | -15.9 | 1.08 | 51 | -271 | -3.1 |

\colordarkgrayAverage | -8.4 | 1.00 | -1 | -271 | 0.1 | |

\colordarkgrayCovariance, $m=0.2$ | -8.4 | 1.00 | -1 | -271 | 0.1 | |

\colordarkgrayWeighted Average, $\alpha =0.2$ | -1.3 | 0.90 | -60 | -271 | 3.7 | |

Net Compensation Constraint, $z=0.2$ | 0.8 | 0.88 | -81 | -271 | 5.0 | |

Mean Residual Difference, $\lambda =5000$ | 2.6 | 0.85 | -101 | -271 | 6.2 | |

Weighted Average, $\alpha =0.4$ | 4.2 | 0.82 | -120 | -271 | 7.3 | |

Weighted Average, $\alpha =0.6$ | 8.2 | 0.73 | -179 | -271 | 10.9 | |

Mean Residual Difference, $\lambda =1000$ | 9.8 | 0.67 | -213 | -271 | 13.1 | |

Net Compensation, $\lambda =1000$ | 10.2 | 0.65 | -227 | -271 | 13.9 | |

Weighted Average, $\alpha =0.8$ | 10.5 | 0.64 | -238 | -271 | 14.6 | |

Net Compensation Constraint, $z=0.6$ | 10.6 | 0.63 | -241 | -271 | 14.8 | |

Mean Residual Difference, $\lambda =100$ | 11.3 | 0.56 | -286 | -271 | 17.5 | |

Net Compensation, $\lambda =100$ | 11.3 | 0.56 | -290 | -271 | 17.7 | |

Net Compensation Constraint, $z=1$ | 11.3 | 0.55 | -297 | -271 | 18.2 | |

OLS | 11.3 | 0.55 | -297 | -271 | 18.2 | |

2 | \colordarkgrayAverage | -31.2 | 1.00 | 0 | 3 | 0.0 |

\colordarkgrayCovariance, $m=.2$ | -31.2 | 1.00 | 0 | 3 | 0.0 | |

\colordarkgrayNet Compensation Constraint, $\lambda =0.2$ | -12.2 | 0.97 | -5 | 3 | 0.3 | |

Weighted Average, $\alpha =0.2$ | 0.4 | 0.94 | -9 | 2 | 0.5 | |

Mean Residual Difference, $\lambda =5000$ | 12.3 | 0.91 | -12 | 1 | 0.8 | |

Net Compensation Constraint, $z=0.6$ | 18.8 | 0.90 | -15 | 1 | 0.9 | |

Net Compensation, $\lambda =5000$ | 22.6 | 0.89 | -16 | 1 | 1.0 | |

Weighted Average, $\alpha =0.4$ | 25.0 | 0.88 | -17 | 0 | 1.1 | |

Net Compensation Constraint, $z=1$ | 40.6 | 0.83 | -24 | -1 | 1.5 | |

Weighted Average, $\alpha =0.6$ | 42.6 | 0.82 | -26 | -1 | 1.6 | |

Mean Residual Difference, $\lambda =1000$ | 47.1 | 0.80 | -29 | -2 | 1.8 | |

Weighted Average, $\alpha =0.8$ | 53.1 | 0.76 | -34 | -3 | 2.1 | |

Net Compensation, $\lambda =1000$ | 55.3 | 0.74 | -37 | -3 | 2.3 | |

Mean Residual Difference, $\lambda =100$ | 56.4 | 0.72 | -41 | -4 | 2.5 | |

Net Compensation, $\lambda =100$ | 56.6 | 0.71 | -42 | -4 | 2.6 | |

OLS | 56.6 | 0.70 | -43 | -4 | 2.6 | |

3 | \colordarkgrayAverage | -637.6 | 1.00 | -1 | 44 | 0.0 |

\colordarkgrayCovariance, $m=.2$ | -637.5 | 1.00 | -1 | 44 | 0.0 | |

\colordarkgrayNet Compensation Constraint, $z=0.2$ | -517.1 | 0.96 | -5 | 40 | 0.3 | |

\colordarkgrayWeighted Average, $\alpha =0.2$ | -392.3 | 0.92 | -11 | 34 | 0.7 | |

\colordarkgrayNet Compensation Constraint, $z=0.6$ | -311.2 | 0.89 | -15 | 31 | 0.9 | |

\colordarkgrayWeighted Average, $\alpha =0.4$ | -201.5 | 0.85 | -21 | 25 | 1.3 | |

\colordarkgrayNet Compensation Constraint, $z=1$ | -152.2 | 0.83 | -25 | 22 | 1.5 | |

\colordarkgrayWeighted Average, $\alpha =0.6$ | -65.1 | 0.78 | -32 | 15 | 2.0 | |

\colordarkgrayMean Residual Difference, $\lambda =5000$ | -28.7 | 0.75 | -36 | 12 | 2.2 | |

Weighted Average, $\alpha =0.8$ | 16.7 | 0.70 | -42 | 5 | 2.6 | |

Net Compensation, $\lambda =5000$ | 37.3 | 0.67 | -47 | 1 | 2.9 | |

Mean Residual Difference, $\lambda =1000$ | 38.7 | 0.66 | -48 | 0 | 3.0 | |

Net Compensation, $\lambda =1000$ | 43.8 | 0.64 | -52 | -3 | 3.2 | |

Mean Residual Difference, $\lambda =100$ | 44.0 | 0.63 | -52 | -4 | 3.2 | |

Net Compensation, $\lambda =100$ | 44.1 | 0.63 | -53 | -4 | 3.2 | |

OLS | 44.1 | 0.63 | -53 | -4 | 3.2 |

Note: Measures calculated based on cross-validated predicted values and sorted on net compensation. Estimators with negative ${R}^{\mathrm{2}}$ values are in shaded text.

## Appendix B: Simulated Analysis Data

The IBM MarketScan Research Databases analyzed in Section 3 of the manuscript cannot be distributed online due to their proprietary nature. They also contain protected patient information. Thus, we created a simulated data set with similar properties using key features and relationships from the original data for reproducibility analyses of our code. The simulated analysis data described below and accompanying code to complete the analyses are available online: https://github.com/zinka88/Fair-Regression.

First, we simulated demographic variables, female and age, by sampling from a Bernoulli distribution and truncated Normal distribution with lower bound $a$ and upper bound $b$:

female | $\sim \text{Bernoulli}(0.52)$ | ||

age | $\sim \text{Normal}(44,12,a=21,b=63).$ |

Next, we generated the 62 binary health variables $\bm{H}=({H}_{1},\mathrm{\dots},{H}_{T})$, each drawn from a Bernoulli distribution and dependent on the demographic variables female and age with coefficients determined by the relationships in the original data. To create the indicator for MHSUD, $A$, we generated 15 binary MHSUD CCS variables $\bm{C}=({C}_{1},\mathrm{\dots},{C}_{15})$ dependent on age, female, and the top six HCCs correlated with MHSUD in the original data. We defined $A=1$ for all observations with at least one MHSUD CCS; 15.7% of observations in the simulated analysis data had MHSUD compared to 13.8% in the original data.

Predictive | Net | Mean | |||||
---|---|---|---|---|---|---|---|

Ratio | Compensation | Residual | Fair | ||||

Method | ${R}^{2}$ | $g$ | ${g}^{c}$ | $g$ | ${g}^{c}$ | Difference | Covariance |

Net Compensation${}^{\u2020}$ | 18.5% | 1.001 | 1.000 | $6 | -$1 | $7 | -1 |

Average | 18.6 | 0.999 | 1.000 | -5 | 1 | -7 | 1 |

Covariance | 18.6 | 0.999 | 1.000 | -5 | 1 | -7 | 1 |

Weighted Average${}^{\mp}$ | 19.0 | 0.984 | 1.004 | -106 | 20 | -127 | 17 |

Mean Residual Difference${}^{\oplus}$ | 19.6 | 0.947 | 1.012 | -364 | 68 | -432 | 57 |

OLS | 19.7 | 0.925 | 1.017 | -512 | 95 | -607 | 80 |

${}^{\u2020}\lambda =20000$, ${}^{\mp}\alpha =0.2$, ${}^{\oplus}\lambda =30000$

Note: Measures calculated based on cross-validated predicted values and sorted on net compensation. Best performing hyperparameters for each estimator (with respect to fairness measures) are displayed. Performance for covariance method was the same for all $m$. ${g}^{c}$ is the complement of g.

To generate $Y$, we added random noise to an intermediary outcome $\ddot{Y}$ dependent on the input vector $X=\{\text{female},\text{age},\bm{H},\bm{C}\}$. We note that while $\bm{C}$ was used to generate $\ddot{Y}$, it is not used later in the estimation steps as this information is not currently included in risk adjustment formulas. $\ddot{Y}$ was determined using a 2-part model. First, to capture the 10.5% of observations without spending in the original data, we generated whether any spending occurred by creating a binary variable $S$ with $S\sim \text{Bernoulli}({p}_{S})$, where ${p}_{S}={\text{logit}}^{-1}[\mathbf{\Omega}\bm{X}]$ and $\mathbf{\Omega}$ is a vector of coefficients based on the original data. Next, for observations with positive spending, we generated the amount of spending that occurred using a log-linear model of spending dependent on $\bm{X}$ to account for the right-skew of the spending outcome:

$$\ddot{Y}=\{\begin{array}{cc}0,\hfill & \text{if}S=0\hfill \\ {e}^{\mathbf{\Phi}\bm{X}},\hfill & \text{if}S=1,\hfill \end{array}$$ |

where $\mathbf{\Phi}$ is a vector of coefficients based on the original data. Lastly, we sampled from a truncated normal centered around each observation in $\ddot{Y}$ to add noise to the generated outcome:

$${Y}_{k}\sim \text{Normal}(\ddot{{Y}_{k}},6000,a=0,b=\text{max}(Y)),$$ |

where ${Y}_{k}$ is the predicted outcome for observation $k$ in the simulated data. The final simulated spending outcome ranged from $0 to $297,206 with a mean of $5,817 and median of $4,881. The average spending for enrollees with MHSUD was $6,812 versus $5,632 for enrollees without MHSUD. ${R}^{2}$ under OLS was 19.7%.

The results from the simulated analysis data are shown in Table B1. As demonstrated in our data analysis presented in Section 3 of the main text, we likewise find that the constrained and penalized estimation methods improve fairness measures without a significant decrease in ${R}^{2}$. The relative rankings of all the methods are similar, although we highlight that net compensation penalized regression performs even more similarly to average constrained and covariance constrained regression methods here.

## REFERENCES

- Ash and Ellis (2012) Ash, AS. and Ellis, RP. (2012), “Risk-adjusted payment and performance assessment for primary care.” Med Care, 50, 643–653.
- Bansal et al. (2014) Bansal, G., Sinha, A., and Zhao, H. (2014), “Tuning Data Mining Methods for Cost-Sensitive Regression: A Study in Loan Charge-Off Forecasting,” Journal of Management Information Systems, 25, 315–336.
- Barocas and Selbst (2016) Barocas, S. and Selbst, AD. (2016), “Big Data’s Disparate Impact,” California Law Review, 104, 671.
- Bechavod and Ligett (2018) Bechavod, Y. and Ligett, K. (2018), “Penalizing Unfairness in Binary Classification,” arXiv pre-print. arxiv.org/abs/1707.00044.
- Bergquist et al. (2018) Bergquist, SL., Layton, TJ., McGuire, TG., and Rose, S. (2018), “Intervening on the Data to Improve the Performance of Health Plan Payment Methods,” NBER Working Paper #24491. nber.org/papers/w24491.
- Berk et al. (2017a) Berk, R., Heidari, H., Jabbari, S., Joseph, M., Kearns, M., Morgenstern, J., Neel, S., and Roth, A. (2017a), “A Convex Framework for Fair Regression,” arXiv pre-print. arxiv.org/abs/1706.02409.
- Berk et al. (2017b) Berk, R., Heidari, H., Jabbari, S., Kearns, M., and Roth, A. (2017b), “Fairness in Criminal Justice Risk Assessments: The State of the Art,” arXiv pre-print. arxiv.org/abs/1703.09207.
- Calders et al. (2013) Calders, T., Karim, A., Kamiran, F., Ali, W., and Zhang, X. (2013), “Controlling Attribute Effect in Linear Regression,” in 2013 IEEE 13th International Conference on Data Mining, Dallas, TX, USA.
- Calmon et al. (2017) Calmon, FP., Wei, D., Ramamurthy, KN., and Varshney, KR. (2017), “Optimized Data Pre-Processing for Discrimination Prevention,” arXiv pre-print. arxiv.org/abs/1704.03354.
- Carey (2017) Carey, C. (2017), “Technological Change and Risk Adjustment: Benefit Design Incentives in Medicare Part D,” American Economic Journal: Economic Policy, 9, 38–73.
- Chouldechova (2017) Chouldechova, A. (2017), “Fair prediction with disparate impact: A study of bias in recidivism prediction instruments,” arXiv pre-print. arxiv.org/abs/1610.07524.
- Chouldechova and Roth (2018) Chouldechova, A. and Roth, A. (2018), “The Frontiers of Fairness in Machine Learning,” arXiv pre-print. arxiv.org/abs/1810.08810.
- Corbett-Davies and Goel (2018) Corbett-Davies, S. and Goel, S. (2018), “The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning,” arXiv pre-print. arxiv.org/abs/1808.00023.
- Dwork et al. (2012) Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. (2012), “Fairness through Awareness,” in Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (ITCS ’12), pp. 214–226, ACM, New York, NY, USA.
- Dwork et al. (2018) Dwork, C., Immorlica, N., Kalai, AT., and Leiserson, M. (2018), “Decoupled classifiers for fair and efficient machine learning,” Proceedings of Machine Learning Research, 81, 119–133.
- Ellis et al. (2018) Ellis, RP., Martins, B., and Rose, S. (2018), “Risk Adjustment for Health Plan Payment,” in Risk Adjustment, Risk Sharing and Premium Regulation in Health Insurance Markets: Theory and Practice, edited by TG. McGuire, and RC. van Kleef. Amsterdam: Elsevier.
- El Mhamdi et al. (2018) El Mhamdi, EM., Guerraoui, R., Hoang, LN., and Maurer, A. (2018), “Removing Algorithmic Discrimination (With Minimal Individual Error),” arXiv pre-print. arxiv.org/abs/1806.02510.
- Ericson et al. (2017) Ericson, KM., Geissler, K., and Lubin, B. (2017), “The Impact of Partial-Year Enrollment on the Accuracy of Risk Adjustment Systems: A Framework and Evidence,” NBER Working Paper #23765. nber.org/papers/w23765.
- Fu et al. (2018) Fu, A., Narasimhan, B., and Boyd, S. (2018), “CVXR: Disciplined Convex Optimization,” web.stanford.edu/~boyd/papers/pdf/cvxr_paper.pdf, online; 30 July 2018.
- Geruso et al. (2017) Geruso, M., Layton, TJ., and Prinz, D. (2017), “Screening in Contract Design: Evidence from the ACA Health insurance exchanges,” NBER Working Paper #22832. nber.org/papers/w22832.
- Gibney (2018) Gibney, E. (2018), “The ethics of computer science: this researcher has a controversial proposal,” nature.com/articles/d41586-018-05791-w, online; 9 August 2018.
- Hardt et al. (2016) Hardt, M., Price, E., and Srebro, N. (2016), “Equality of Opportunity in Supervised Learning,” arXiv pre-print. arxiv.org/abs/1610.02413.
- Hastie et al. (2009) Hastie, T., Tibshirani, R., and Friedman, J. (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd Edition, New York City: Springer.
- Jacobs and Sommers (2015) Jacobs, DB. and Sommers, BD. (2015), “Using Drugs to Discriminate – Adverse Selection in the Insurance Marketplace,” NEJM, 372, 399–402.
- Johndrow and Lum (2017) Johndrow, JE. and Lum, K. (2017), “An algorithm for removing sensitive information: application to race-independent recidivism prediction,” arXiv pre-print. arxiv.org/abs/1703.04957.
- Kamiran and Calders (2009) Kamiran, F. and Calders, T. (2009), “Classifying without Discrimination,” in 2009 2nd International Conference on Computer, Control and Communication, Karachi, Pakistan.
- Kamishima et al. (2012) Kamishima, T., Akaho, S., Asoh, H., and Sakuma, J. (2012), “Fairness-Aware Classifier with Prejudice Remover Regularizer,” in Machine Learning and Knowledge Discovery in Databases, vol. 7524.
- Kautter et al. (2014) Kautter, J., Pope, GC., Ingber, M., Freeman, S., Patterson, L., Cohen, M., and Keenan, P. (2014), “The HHS-HCC Risk Adjustment Model for Individual and Small Group Markets under the Affordable Care Act,” Medicare & Medicaid Research Review, 4, mmrr2014–004–03–a03.
- Kleinberg et al. (2018) Kleinberg, J., Ludwig, J., Mullainathan, S., and Rambachan, A. (2018), “Algorithmic Fairness,” AEA Papers and Proceedings, 108, 22–27.
- Kleinberg et al. (2016) Kleinberg, J., Mullainathan, S., and Raghavan, M. (2016), “Inherent Trade-Offs in the Fair Determination of Risk Scores,” arXiv pre-print. arxiv.org/abs/1609.05807.
- Kusner et al. (2018) Kusner, MJ., Loftus, JR., Russell, C., and Silva, R. (2018), “Counterfactual Fairness,” arXiv pre-print. arxiv.org/abs/1703.06856.
- Layton et al. (2017) Layton, TJ., Ellis, RP., McGuire, TG., and van Kleef, RC. (2017), ‘‘Measuring efficiency of health plan payment systems in managed competition health insurance markets,” Journal of Health Economics, 56, 237–255.
- McGuire and van Kleef (2018) McGuire, T. and van Kleef, R. (eds.) (2018), Risk Adjustment, Risk Sharing and Premium Regulation in Health Insurance Markets, Amsterdam: Elsevier.
- McGuire et al. (2013) McGuire, TG., Glazer, J., Newhouse, JP., Normand, SL., Shi, J., Sinaiko, AD., and Zuvekas, SH. (2013), “Integrating Risk Adjustment and Enrollee Premiums in Health Plan Payment,” Journal of Health Economics, 32, 1263–1277.
- Mitchell and Shadlen (2018) Mitchell, S. and Shadlen, J. (2018), “Mirror Mirror: Reflections on Quantitative Fairness,” speak-statistics-to-power.github.io/fairness/, online; 31 July 2018.
- Montz et al. (2016) Montz, E., Layton, TJ., Busch, AB., Ellis, RP., Rose, S., and McGuire, TG. (2016), ‘‘Risk adjustment simulation: Plans may have incentives to distort mental health and substance use coverage,” Health Affairs, 35, 1022–1028.
- O’Neil (2017) O’Neil, C. (2017), Weapons of Math Destruction, New York City: Broadway Books.
- Park and Basu (2018) Park, S. and Basu, A. (2018), “Alternative evaluation metrics for risk adjustment methods,” Health Economics, 27, 984–1010.
- Pope et al. (2004) Pope, GC., Kautter, J., Ellis, RP., et al. (2004), “Risk Adjustment for Medicare Capitation Payments Using the CMS-HCC Model,” Health Care Financing Review, 25, 119–141.
- Rose (2016) Rose, S. (2016), “A Machine Learning Framework for Plan Payment Risk Adjustment,” Health Services Research, 51, 2358–2374.
- Rose et al. (2017) Rose, S., Bergquist, SL., and Layton, TJ. (2017), “Computational health economics for identification of unprofitable health care enrollees,” Biostatistics, 18, 682–694.
- Rose and McGuire (2018) Rose, S. and McGuire, TG. (2018), “Limitations of p-values and R-squared for stepwise regression building: A fairness demonstration in health policy risk adjustment,” arXiv pre-print. arxiv.org/abs/1803.05513.
- Shepard (2016) Shepard, M. (2016), “Hospital Network Competition and Adverse Selection: Evidence from the Massachusetts Health Insurance Exchange,” NBER Working Paper #22600. nber.org/papers/w22600.
- Shrestha et al. (2018) Shrestha, A., Bergquist, SL., Montz, E., and Rose, S. (2018), “Mental Health Risk Adjustment with Clinical Categories and Machine Learning,” Health Services Research, 53, 3189–3206.
- van Kleef et al. (2017) van Kleef, RC., McGuire, TG., van Vliet, R., and van de Ven, W. (2017), “Improving risk equalization with constrained regression,” The European Journal of Health Economics, 18, 1137–1156.
- van Kleef et al. (2013) van Kleef, RC., van Vliet, RC., and Van de Ven, WP. (2013), “Risk equalization in The Netherlands: an empirical evaluation,” Expert Rev Pharmacoecon Outcomes Res, 13, 829–839.
- van Kleef et al. (2018) van Kleef, RC., van Vliet, RC., and van de Ven, WP. (2018), “Health plan payment in the Netherlands,” in Risk Adjustment, Risk Sharing and Premium Regulation in Health Insurance Markets: Theory and Practice, edited by TG. McGuire, and RC. van Kleef. Amsterdam: Elsevier.
- Withagen-Koster et al. (2018) Withagen-Koster, AA., van Kleef, RC., and Eijkenaar, F. (2018), “Examining unpriced risk heterogeneity in the Dutch health insurance market,” The European Journal of Health Economics, 19, 1351–1363.
- Zafar et al. (2017a) Zafar, MB., Valera, I., Rodriguez, MG., and Gummadi, KP. (2017a), “Fairness Beyond Disparate Treatment & Disparate Impact: Learning Classification without Disparate Mistreatment,” arXiv pre-print. arxiv.org/abs/1610.08452.
- Zafar et al. (2017b) — (2017b), “Fairness Constraints: Mechanisms for Fair Classification,” arXiv pre-print. arxiv.org/abs/1507.05259.
- Zemel et al. (2013) Zemel, R., We, Y., Swersky, K., Pitassi, T., and Dwork, C. (2013), “Learning Fair Representations,” in Proceedings of the 30th International Conference on Machine Learning, PMLR, vol. 28, pp. 325–333, Atlanta, Georgia, USA.
- Zliobaite (2015) Zliobaite, I. (2015), “A survey on measuring indirect discrimination in machine learning,” arXiv pre-print. arxiv.org/abs/1511.00148.
- Zliobaite et al. (2011) Zliobaite, I., Kamiran, F., and Calders, T. (2011), “Handling Conditional Discrimination,” in 2011 IEEE 11th International Conference on Data Mining, pp. 992–1001, Vancouver, BC, Canada.