A study in Rashomon curves and volumes: A new perspective on generalization and model simplicity in machine learning

  • 2019-08-05 17:52:53
  • Lesia Semenova, Cynthia Rudin
  • 1

Abstract

The Rashomon effect occurs when many different explanations exist for thesame phenomenon. In machine learning, Leo Breiman used this term to describeproblems where many accurate-but-different models exist to describe the samedata. In this work, we study how the Rashomon effect can be useful forunderstanding the relationship between training and test performance, and thepossibility that simple-yet-accurate models exist for many problems. Weintroduce the Rashomon set as the set of almost-equally-accurate models for agiven problem, and study its properties and the types of models it couldcontain. We present the Rashomon ratio as a new measure related to simplicityof model classes, which is the ratio of the volume of the set of accuratemodels to the volume of the hypothesis space; the Rashomon ratio is differentfrom standard complexity measures from statistical learning theory. For ahierarchy of hypothesis spaces, the Rashomon ratio can help modelers tonavigate the trade-off between simplicity and accuracy in a surprising way. Inparticular, we find empirically that a plot of empirical risk vs. Rashomonratio forms a characteristic $\Gamma$-shaped Rashomon curve, whose elbow seemsto be a reliable model selection criterion. When the Rashomon set is large,models that are accurate - but that also have various other useful properties -can often be obtained. These models might obey various constraints such asinterpretability, fairness, monotonicity, and computational benefits.

 

Quick Read (beta)

A study in Rashomon curves and volumes: A new perspective on generalization and model simplicity in machine learning

Lesia Semenova Department of Computer Science, Duke University Cynthia Rudin Department of Computer Science, Duke University
\xpatchcmd
Proof.
\@addpunct

.\@addpunct: \pgfplotssetcompat=newest

The Rashomon effect occurs when many different explanations exist for the same phenomenon. In machine learning, Leo Breiman used this term to describe problems where many accurate-but-different models exist to describe the same data. In this work, we study how the Rashomon effect can be useful for understanding the relationship between training and test performance, and the possibility that simple-yet-accurate models exist for many problems. We introduce the Rashomon set as the set of almost-equally-accurate models for a given problem, and study its properties and the types of models it could contain. We present the Rashomon ratio as a new measure related to simplicity of model classes, which is the ratio of the volume of the set of accurate models to the volume of the hypothesis space; the Rashomon ratio is different from standard complexity measures from statistical learning theory. For a hierarchy of hypothesis spaces, the Rashomon ratio can help modelers to navigate the trade-off between simplicity and accuracy in a surprising way. In particular, we find empirically that a plot of empirical risk vs. Rashomon ratio forms a characteristic Γ-shaped Rashomon curve, whose elbow seems to be a reliable model selection criterion. When the Rashomon set is large, models that are accurate – but that also have various other useful properties – can often be obtained. These models might obey various constraints such as interpretability, fairness, monotonicity, and computational benefits.

1 Introduction

The 1950 Kurosawa film “Rashomon”[26] revolves around four characters describing entirely different perspectives on the same crimes. Based on this idea that there could be many seemingly accurate descriptions of the same data, Leo Breiman [6] coined the term “Rashomon effect” to describe cases when there exist many different approximately-equally accurate models. He noticed that this effect happens very often; there is no “best” model from most finite datasets, only many good explanations. Of course, in machine learning, our goal is not necessarily to find out the truth; it is to predict well out-of-sample.

Decades of study about generalization in machine learning have provided many different mathematical theories. Many of them measure the complexity of classes of functions without considering the data (e.g., VC theory [44]), or measure properties of specific algorithms (e.g., algorithmic stability [5]). However, none of these theories seems to directly capture a phenomenon that occurs throughout practical machine learning. In particular, there are many datasets for which many standard machine learning algorithms perform similarly. In these cases, the machine learning models tend to generalize well. Furthermore, in these same cases, there is often a simpler model that performs similarly and also generalizes well. Perhaps the Rashomon effect can help us to explain that phenomenon.

In this work, we aim to quantify the Rashomon effect, show that it has implications for generalization, and show it has implications for the existence of simpler models. If the Rashomon effect is large, it means that there exists a large number of models that perform approximately-equally-well on the training data – that is, there exist many differing explanations for what we observe in the data. We quantify the size of the Rashomon effect through the Rashomon volume, which is the size of the set of models that performs almost equally well on the training data. The Rashomon set is defined as the set of these well-performing models. An illustration of the Rashomon set is shown in Figure 1. The Rashomon ratio is a ratio of the Rashomon volume to the volume of a hypothesis space. In this manuscript, we explore the connections between the Rashomon volume, Rashomon ratio, hierarchies of hypothesis classes, training performance and generalization.

(a) From the side
(b) From above
(c) From underneath
Figure 1: An illustration of a possible Rashomon set in two dimensional hypothesis space . Red plane represents the Rashomon parameter θ. Models below the plane belong to the Rashomon set R^set(,θ).

When the Rashomon set is large, it is reasonable to assume that models with various properties can exist inside of it. For example, sparser, more transparent models, models that are easier to compute in practice, models that obey domain-specific constraints such as monotonicity along a given set of features, models that obey fairness constraints, and models that rely more on features that we can trust – these models may all live within a single large Rashomon set. While proving the existence of an interpretable model in the Rashomon set is a challenging practical problem in general (without actually solving for such a model directly, which can also be practically hard), we show a few specific results to this end. Specifically, there are assumptions that allow us to show existence of a smaller hypothesis set of simpler models within the Rashomon set. If the assumptions are satisfied, a model from a simpler class is approximately as accurate as the most accurate models within the hypothesis space, which consequentially leads to better generalization guarantees. The assumptions are based in approximation theory, which models how one class of functions can approximate another.

The Rashomon ratio can serve as a gauge of simplicity for a learning problem, though we use it in a different context than in standard learning theory. As a property of both a dataset and a hypothesis class, it differs from the VC dimension [45] (as Rashomon ratio is specific to a dataset), it differs from algorithmic stability [5] (as the Rashomon ratio does not rely on robustness of an algorithm with respect to changes in the data), it differs from local Rademacher complexity [3] (as the Rashomon ratio does not measure the ability of the hypothesis space to handle random changes in targets and actually benefits from multiple similar models), and it differs from geometric margins [44] (as one can have a small margin with a large Rashomon ratio, and margins are measured with respect to one model, whereas the Rashomon ratio considers the existence of many). The Rashomon set is not simply a flat minimum; it could consist of many non-flat local minima, and it works for discrete hypothesis spaces where gradients, and thus “sharpness” do not exist. We provide basic generalization bounds showing how the Rashomon ratio gauges the existence of simpler-yet-accurate solutions that generalize well.

We empirically observe a trend across different datasets between the Rashomon ratio and empirical risk when considering a hierarchy of hypothesis classes (in particular, decision trees with increasing depth, or linear models with increasing model size). This trend is in the form of a Γ-shaped curve, which is formed as we move up the hierarchy to increase the size of the hypothesis class. Specifically, as the size of the hypothesis class increases, first the empirical risk of the best model in the class decreases, and the Rashomon ratio grows or remains approximately the same; then the empirical risk of the best model in the class is relatively constant, but the Rashomon ratio decreases. This trend, which we call the Rashomon curve, occurs for at least the 36 classification datasets we downloaded from the UCI Machine Learning Repository [16] for decision tree classifiers of various depths, and linear models of various sizes.

Empirically, we find that the turning point in the Γ-shaped Rashomon curve, which can be formulated as a simple minimax optimization problem, is often a good choice for model selection. This Rashomon elbow balances between a low complexity (or a smaller size) hypothesis space (corresponding to a small Rashomon volume) and a low empirical risk (corresponding to a high accuracy). We show the implication of using the Rashomon elbow to choose the complexity of a hypothesis space to balance accuracy with generalization.

We also discuss different methods of measuring the Rashomon volume and the Rashomon ratio. For multiple linear regression, we derive a closed form solution for the Rashomon volume in parameter space. When the Rashomon set is bounded and convex in parameter space, we design a separating oracle and use known randomized algorithm guarantees. In the more general case, we propose to use rejection sampling and importance sampling methods in the hypothesis space for estimation of the Rashomon ratio, and use these for our experiments.

Our results have implications beyond those where the size of the Rashomon volume can be estimated in practice. In particular, our results indicate that when many machine learning methods perform similarly on the same dataset (without overfitting), it could be because the Rashomon set of the functions these algorithms consider is large. In that case, it may be worthwhile to find models within the Rashomon set that have desirable properties, such as interpretability. Interpretability is a type of constraint, and since it is harder to optimize for constrained models than to simply run several different standard machine learning methods, it is worthwhile to run the standard machine learning methods first. This would allow the user to gauge whether the constrained optimization is likely to yield a model that is approximately as accurate as the standard methods.

We summarize the contributions of this work as follows: (i) We define the Rashomon volume and the corresponding Rashomon ratio as important characteristics of the Rashomon set. (ii) We provide generalization bounds for models from the Rashomon set, and show that the size of the Rashomon set serves as a barometer for the existence of accurate-yet-simpler models that generalize well. These are different than standard learning theory bounds that consider the distance between the true and empirical risks for the same function. (iii) We illustrate the Γ-shaped Rashomon curve on many datasets, and show that the Rashomon elbow can be a useful choice for model selection. (iv) We show empirically that in cases when a large Rashomon set occurs, most machine learning methods tend to perform similarly, and also in these cases, interpretable or sparse (yet accurate) models exist. (v) We provide several approaches for estimating the size of the Rashomon set. (vi) We demonstrate that the Rashomon ratio, as a gauge of simplicity of a machine learning problem, is different from other known complexity measures such as VC-dimension, algorithmic stability, geometric margin, and Rademacher complexity. It is also different from flat minima.

2 Related work

Rashomon sets: Rashomon sets have been used for various purposes [40, 18, 13], though the concept has been given different names. For instance, Srebro et al. [40] consider a loss class of close-to-optimal models, and with an assumption of H-smoothness of a loss function, they obtain a tighter excess risk bound through local Rademacher complexity [3]. Our bounds do not work the same way and aim to prove a different type of result. Other works aim to search through the Rashomon set to find the most extreme models within it, rather than looking at the size of the Rashomon set, as we do in this work. For instance, Fisher et al.[18] and Coker et al. [13] leverage the Rashomon set in order to understand the spectrum of variable importance and other statistics across the set of good models. Our work considers the existence of models from simpler classes rather than exploring the Rashomon set to find a range of variable importance or other statistics. The work of Fisher et al. grew out of work by Tulabandhula and Rudin [42, 43] that uses the Rashomon set to assist with decision making, by finding the range of downstream operational costs associated with the Rashomon set. Rashomon sets are related to p-hacking and robustness of estimation, because the Rashomon set is a set over which one might conduct a sensitivity analysis to choices made by an analyst [13].

Flat minima or wide valleys: The concept of flat minima (wide valleys) has been explored in the deep learning literature as a possible way to understand convergence properties of the complicated, non-convex loss functions that deep networks traverse during training [21, 17, 24, 11, 24]. Based on a minimum-message-length argument [46], several works claim that flat loss functions lead to better generalization due to a robustness to the noise around the minimum [21, 24, 11]. Following Hochreiter and Schmidhuber [21], Dinh et al. [17] define ϵ-flatness in a similar manner to our definition of the Rashomon set, except for a few important differences. In particular, our Rashomon set is defined over the hypothesis space, while ϵ-flatness is defined in a parameter space (though sometimes we use parameter space for ease of computation), and the Rashomon set is not necessarily a single connected component (although it might be in the case of a convex loss over a continuous domain), while ϵ-flatness pertains only to a connected set. This means that the Rashomon set can contain models from different local minima, or can be defined on discrete spaces, while ϵ-flatness is relevant only for continuous loss functions. Another way of quantifying flatness is σ-sharpness [24, 17], which measures the change of the loss function inside a σ-ball in a parameter space. In the case of a connected Rashomon set, this loss difference corresponds to the Rashomon parameter θ.

Statistical learning theory: Numerous works provide generalization bounds based on different complexity measures, and under different assumptions. Some of them include Rademacher [40, 22] and Gaussian complexities [22], PAC-Bayes theorems [27], covering numbers bounds [49], and margin bounds [45, 38, 25], etc. In contrast, the Rashomon ratio provides a certificate of the existence of a simpler model that generalizes, rather than acting itself as a simplicity measure. The use of approximating sets, as used extensively in this paper, is used throughout the literature on learning theory [28, 31, 38, 33]. An excellent example of this is the classical generalization bound for boosting and margins [38], which uses combinations of several random draws of base classifiers to represent combinations of base classifiers. This is an instance of the so-called “Maurey’s lemma,” which often provides this approximating set for linear model classes.

Model selection: The closest model selection literature to our work [30, 39] focuses on separately estimating the complexity of each hypothesis space within a hierarchy. The complexity of each member of the hierarchy is measured through VC-dimension [39] or empirical covers [30]; once a member of the hierarchy is chosen (a specific hypothesis space), they would choose a model that minimizes risk within the hypothesis space. Our work also chooses models that minimize the risk in each hypothesis space, and provides a specific guideline on how to choose the hypothesis space using the Rashomon elbow.

We know of no other works that illustrate the Rashomon curve, which we have observed universally among datasets we have considered.

3 Rashomon set definitions and notation

Consider a training set of n data points S={z1,z2,,zn}, zi=(xi,yi) drawn i.i.d. from an unknown distribution 𝒟 on a bounded set 𝒵=𝒳×𝒴, where 𝒳p and 𝒴 are an input and an output space respectively. Our hypothesis class is ={f:𝒳𝒴}. There can be a prior distribution ρ over functions, often it is uniform. To measure the quality of a prediction made by a hypothesis, we use a loss function ϕ:𝒴×𝒴+. Specifically, for each given point z=(x,y) and a hypothesis f, the loss function is ϕ(f(x),y). For a given f we will also overload notation by writing l:×𝒵+ that takes f explicitly as an argument: l(f,z)=ϕ(f(x),y). We are interested in learning a model f that minimizes the true risk L(f)=𝔼z𝒟[ϕ(f(x),y)], which depends on an unknown distribution 𝒟 and therefore is estimated with an empirical risk: L^(f)=1ni=1nϕ(f(xi),yi). For the rest of this paper, data are drawn from the unknown distribution 𝒟 on 𝒳×𝒴, unless otherwise specified.

3.1 Basic Rashomon set and ratio definition

We define the empirical Rashomon set (or simply Rashomon set) as a subset of models of the hypothesis space that have training performance close to the best model in the class, according to a loss function. More precisely:

Definition 3.1.

(Rashomon set) Given θ0, a dataset S, a hypothesis class , and a loss function ϕ, the Rashomon set R^set(,θ) is the subspace of the hypothesis space defined as follows:

R^set(,θ):={f:L^(f)L^(f^)+θ},

where f^ is an empirical risk minimizer for the training data S with respect to a loss function ϕ: f^argminfL^(f).

The hypothesis class in the definition of the Rashomon set can be a specific well-defined hypothesis class, such as the space of decision trees of depth D or neural nets with D hidden layers, or it can be a more general space (a meta-hypothesis space) that contains models from different hypothesis classes (e.g., linear functions, polynomials up to degree D, and piece-wise constant functions) with the training error as a loss function.

We call θ the Rashomon parameter. To compute the size of the Rashomon set we will use the Rashomon volume 𝒱(R^set(,θ)), where 𝒱():2 measures the volume of a set in the hypothesis space, potentially weighted by the prior ρ over functions. If the hypothesis space is discrete then the Rashomon volume can be calculated by counting how many models are inside R^set(,θ) or, in other words, by computing the cardinality of the Rashomon set directly. In the continuous case we assume that the Rashomon set is bounded and therefore 𝒱(R^set(,θ))<.

When we compare the size of the Rashomon set for different hypothesis spaces, the Rashomon volume might not be an informative measure, especially if the hypothesis spaces we consider have very different complexity. To normalize the Rashomon volume, we introduce the Rashomon ratio, which uses the hypothesis space to form the denominator of the ratio. In order to keep the denominator from containing functions that are irrelevant (and thus increase the denominator without a chance of increasing the numerator), we limit the hypothesis space to contain only models that vary within the bounded domain 𝒵 where the data reside. We will assume that the hypothesis space is bounded and 𝒱()<. We define the Rashomon ratio as follows:

Definition 3.2.

Let be a hypothesis space given a dataset S. The Rashomon ratio is a ratio of the volume of models inside the Rashomon set R^set(,θ) to the volume of models in the hypothesis space :

R^ratio(,θ)=𝒱(R^set(,θ))𝒱(). (1)

By our definitions, this ratio is always between 0 and 1. It represents the fraction of models that are good (the fraction of models that fit the data about equally well). If θ is small, a larger the Rashomon ratio implies that more models perform about equally well. When θ is large enough, the Rashomon set contains all models in and the ratio is 1.

If we want to specify the dataset S that is used to compute the Rashomon set, we indicate the dataset in the subscript, as R^setS(,θ) and R^ratioS(,θ).

In the definitions above, the Rashomon set considered multiplicity of models. In the definition in the next subsection, we consider multiplicity of predictions instead. Whereas in Definition 3.1, two models that are different but make the same predictions would be considered different, these two models would join the same equivalence class in the definition of the pattern Rashomon ratio.

3.2 Pattern Rashomon ratio for binary classification

For a binary classification problem on dataset S,

Definition 3.3.

Given a hypothesis space , dataset S, and a binary-valued function ζ:×𝒵n{0,1} (which is usually either sign(f(xi)) or the loss 1[sign(f(xi)yi]), the pattern Rashomon ratio is the ratio of possible realizations of ζ vectors from functions within the Rashomon set. Denote 𝜻(f,S)=[ζ(f,z1),ζ(f,z2),,ζ(f,zn)]. Also denote binary(i) as the vectorized binary representation of i of size n (e.g., binary(1)=[0,0,0,0,1] when n=5). The pattern Rashomon ratio is:

R^ratiopat(,θ)=j=02n-1𝟙[fR^setS:𝜻(f,S)=binary(j)]j=02n-1𝟙[f:𝜻(f,S)=binary(j)].

Note that this ratio is different from the Rashomon ratio defined in (1) as a multiplicity of models. The pattern Rashomon ratio measures the diversity of predictions made by functions in the Rashomon set compared to the diversity of prediction of functions within the hypothesis space. If the pattern Rashomon ratio is high, it means that the Rashomon set contains not only multiple models, but also multiple models with different properties, depending on the definition of ζ.

There are cases where the pattern Rashomon ratio and the regular Rashomon ratio are the same; for example, in the case of binary classification with binary hypotheses and the sign performance function ζ(f,z)=sign(f(x)).

The pattern Rashomon ratio has useful approximation guarantees. In particular as the size of the model space D grows to be infinitely large (e.g., the depth of the decision tree grows infinitely, or number of parameters grows to infinity), the pattern Rashomon ratio approaches a fixed value that depends on the Rashomon parameter and number of points in the dataset only. This intuition is summarized in the next statement.

Statement 3.1 (Approximation guarantees for the pattern Rashomon ratio).

Let D represent the size of the hypothesis space F. For binary classification and sign performance function ζ(f,z)=sign(f(x)), as D, the pattern Rashomon ratio R^ratiopat(F,θ)R¯pat=iθn(ni)2n, and for θ1/2: 2n(H(θ)-1)8nθ(1-θ)R¯pat2n(H(θ)-1), where n is the size of the training dataset, and H(θ)=-θlog2θ-(1-θ)log2(1-θ) is the binary entropy.

In contrast, there is no obvious limit value for the Rashomon ratio. There exist data distributions such that for a fixed value θ, as the size of the hypothesis space grows, the Rashomon ratio will converge to 0. There also exist data distributions such that the Rashomon ratio may not converge to either zero or one. For example, separable data with a large margin may lead to a limiting Rashomon ratio that is greater than zero.

The Rashomon ratio and the pattern Rashomon ratio, as properties of a dataset and a hypothesis class, serve as indicators of the simplicity of the learning problem. A large Rashomon ratio means that the Rashomon set contains multiple models within the hypothesis space that have approximately constant empirical risk. These accurate models could either come from multiple local minima or from one wide local minimum. In this case, potentially every optimization procedure could lead to a hypothesis from the Rashomon set. In this way, for large Rashomon sets, accurate models tend to be easier to find, because optimization procedures can find them. In other words, if the Rashomon ratio is large, the Rashomon set could contain many accurate and simple models, and the learning problem becomes simpler in general. On the other hand, smaller Rashomon ratios might imply a harder learning problem, especially in the case of few deep and narrow local minima.

We introduce one more variation of the Rashomon set, where the threshold that bounds the risk does not depend on the empirical risk minimizer. Given a parameter γ0, we call the Rashomon set with restricted empirical risk an anchored Rashomon set:

R^setanc(,γ)={f:L^(f)γ}.

We define also the true anchored Rashomon set based on the true risk as follows:

Rsetanc(,γ)={f:L(f)γ}.

We will denote R^ratioanc(,γ) and Rratioanc(,γ) as the Rashomon ratios computed on the anchored Rashomon set and the true anchored Rashomon set. We could choose γ so that this definition mirrors Definition 3.1, so that the true risk is restricted by quantity γ=L(f*)+θ, where f* is a model that minimizes the true risk.

The true anchored Rashomon set, as it turns out, is sometimes very useful in practice, but there is a caveat, which is that one does not know a priori whether or not it will be useful for a particular problem. In fact, one does not even know afterwards whether the true anchored Rashomon set helped us for a particular problem, even after we had solved the optimization problem to get a model. We explain this in the next section, and then spend most of our effort considering (empirical) Rashomon sets, which are easier to work with in practice. In the next section, we discuss the simplicity and generalization properties of models that are in the Rashomon set.

4 Rashomon set models: simplicity and generalization

Building on the discussions from the previous section, let us consider two function classes with different levels of complexity, where the lower-complexity class serves as a good approximating set for the higher-complexity class. The function classes are called 1, for the simpler class, and 2, for the more complex class. Generalization bounds would be tighter if we could use the lower complexity class 1, but if we are considering functions from 2, learning theory often has us consider complexity of 2. We have several questions to answer:

  1. 1.

    What if the higher-complexity hypothesis space we chose was more complex than necessary for modeling the data? In that case, can we still have guarantees on test performance of the best classifier in the complex space 2 that leverage only complexity of the lower-complexity class 1? In particular, perhaps special properties of the more complex class can help our analysis; if these properties hold, then we can get all the guarantees we need about the best possible attainable test performance on the more complex class 2 by looking only at optimal training performance on the less complex class 1. (As a preview to Section 4.1 where we answer this question, the property on the complex class that will help us is that the true anchored Rashomon set is large. However, we cannot know in practice when this property holds, which is a disadvantage of this analysis. When the property holds, even if we do not know it holds, then this property still becomes helpful in practice.)

  2. 2.

    Before looking at the data, let us assume we chose to examine the more complex hypothesis space 2, not knowing that the lower-complexity class 1 would suffice. We may have done this for computational reasons, since perhaps it is much easier to optimize over the higher-complexity class than the lower-complexity class. For instance, perhaps the lower-complexity class is decision trees with limited depth, which are hard to optimize over, whereas the larger class considers support vector machines, which are much easier to optimize. Our question is, after having done some analysis on the higher complexity class, can we still have generalization guarantees that use only the complexity of the lower-complexity class? Specifically, can we guarantee the existence of a simple-yet-accurate model (a model from the lower-complexity space with low training error) that will generalize well? Can we guarantee that many such simple-yet-accurate models exist? (As a preview to Section 4.2, we will use smoothness and the size of the Rashomon set as key tools to prove the existence of simple-yet-accurate models.)

  3. 3.

    Again let us assume we chose to examine the more complex hypothesis space, 2. If the Rashomon set of 2 is large, can we use this information to get a better generalization guarantee for all functions f22 that we might find during our analysis? In particular, if the Rashomon set of 2 is large, and 1 serves as a good approximating set for 2, then can we get a generalization bound for all f22 that uses the complexity of 1? If so, this indicates that the learning problem is not actually as complex as it might seem if we had only looked at the more complex class of functions 2. In Section 4.3 we will again use smoothness and the size of the Rashomon set to create this bound.

In creating these bounds, we hoped to distill the problems to their essence, so that the bounds are as close as possible to basic Occham’s razor bounds. These bounds, including those for discrete hypothesis spaces can be generalized to more complex statistical learning theory analyses if desired. Note that most of our bounds do not serve the same purpose as standard statistical learning theoretic bounds, in that they do not necessarily aim to bound generalization error for a single function. Standard learning theory analysis handles this question nicely; we are concerned with other questions here.

4.1 The true anchored Rashomon set can be very helpful… but you might not know when

As in classic Occham’s razor bounds, we start with finite hypothesis spaces. Let us consider finite hypothesis spaces 1 and 2, where 2 has more functions (and thus a larger complexity) than 1, and 12. Let us consider the first question discussed above: Given 1 and 2, can we have guarantees on the best possible test performance of models in 2, which depend only on 1’s complexity and empirical performance? In the following two theorems, we will make a key assumption that allows us to do this: we assume a sufficiently large anchored true Rashomon set for 2. Specifically, we assume that there are a large number of functions from 2 that have true risk below γ. This large number of functions is assumed to be large enough to contain at least one function from 1. In that case, this special function is likely to generalize (due to learning theory results on 1). This is how we will be able to analyze the best possible true risk from 2 in terms of what we can observe from 1 on the data.

Here, || denotes the cardinality of the finite space . These bounds can be generalized to infinite hypothesis spaces, but they are designed for intuition, which works nicely with finite hypothesis spaces.

Theorem 4.1 (The advantage of a large true anchored Rashomon set on discrete hypothesis spaces I).

Consider finite hypothesis spaces F1 and F2, such that F1F2. Let the loss l be bounded by b, l(f2,z)[0,b]f2F2,zZ. Define an optimal true function f2*argminf2F2L(f2). Let us assume that the true anchored Rashomon set is large enough to include a function from F1, so there exists a model f~1F1 such that f~1Rsetanc(F2,γ). In that case, for any ϵ>0 with probability at least 1-ϵ with respect to the random draw of data:

|L(f2*)-L^(f^1)|γ+2blog|1|+log2/ϵ2n,

where f^1argminf1F1L^(f1).

That is, when our assumption on the geometry of the true risk landscape is correct, we can approximate the best possible population model from 2 with the best empirical model within 1, and the bound depends only on the complexity of 1 and not 2.

The main assumption (of a sufficiently large anchored true Rashomon set) is an assumption about the population, and does not rely on the sample. It relies only on the existence of one special function in the true anchored Rashomon set. There are no smoothness assumptions on the loss function, nor any other major assumptions about the relationship between 1 and 2 other than that 12. If the assumption holds, then we gain the benefit of guarantees on 2 from looking only at 1 empirically. We cannot check whether the assumption holds since it involves the true risk, but practitioners can reap the benefits of it anyway: if their algorithm chooses a function from 1 in practice when minimizing among functions from 2, they may be unknowingly achieving a tighter guarantee on test error.

Let us make the connection of this result to Rashomon sets more explicit. If 1 is a random sample of functions from 2, and if 2 has a large true anchored Rashomon set, then 1 is likely to include at least one model from the true anchored Rashomon set. In that case, Theorem 4.1 above applies. Conversely, if 2 has a small true anchored Rashomon set, 1 is unlikely to contain a model from the true anchored Rashomon set, in which case, Theorem 4.1 does not apply, and there is no guarantee. Let us write this formally.

Theorem 4.2 (The advantage of a large true anchored Rashomon set on discrete hypothesis spaces II - random sampling).

Consider finite hypothesis spaces F1 and F2, such that F1F2 and F1 is uniformly drawn from F2 without replacement. Define an optimal true function f2*argminf2F2L(f2). For a loss l bounded by b and any ϵ>0, with probability at least (1-ϵ)p with respect to the random draw of functions from F2 to form F1 and with respect to the random draw of data:

|L(f2*)-L^(f^1)|γ+2blog|1|+log2/ϵ2n, (2)

where p=1-((1-Rratioanc(F2,γ))|F2||F1|)(|F2||F1|)=1-i=1|Rsetanc(F2,γ)|(1-|F1||F2|-|Rsetanc(F2,γ)|+i), and f^1argminf1F1L^(f1).

Please refer to Table 1 for lower bounds of |1| from Theorem 4.2, given values of |2| and R^ratio(2,γ).

If |2|=100000 and R^ratio(2,γ)0.1% then
          if |1|5156 then with probability at least 99% the bound (2) holds.
If |2|=100000 and R^ratio(2,γ)0.5% then
          if |1|1051 then with probability at least 99% the bound (2) holds.
If |2|=100000 and R^ratio(2,γ)1% then
          if |1|526 then with probability at least 99% the bound (2) holds.
If |2|=100000 and R^ratio(2,γ)2% then
          if |1|262 then with probability at least 99% the bound (2) holds.
If |2|=100000 and R^ratio(2,γ)5% then
          if |1|104 then with probability at least 99% the bound (2) holds.
Table 1: Examples of the possible usage of Theorem 4.2.

Theorem 4.2 holds with a probability that depends on the likelihood of drawing 1 from 2. If |1| is small compared with |2|, the probability p in the theorem statement is small as well. In the following lemma, we provide a lower bound on the Rashomon ratio so that the bound in Theorem 4.2 holds with a given value of probability p=ϵ. As a key step of the proof of the lemma, we will lower bound the probability of sampling without replacement with the probability of sampling with replacement. Please see the details of the proof in Appendix D.3.

Lemma 4.2.1.

For a finite hypothesis space F2 of size |F2|, we will draw |F1| functions uniformly without replacement from F2 to form F1. If the true anchored Rashomon ratio of the hypothesis space F2 is at least

Rratioanc(2,γ)1-ϵ1|1|

then with probability at least 1-ϵ with respect to the random draw of functions from F2 to form F1, the Rashomon set contains at least one model f~1 from F1.

Combining Lemma 4.2.1 and Theorem 4.1 we get the following statement:

Theorem 4.3 (The advantage of a large true anchored Rashomon set on discrete hypothesis spaces III - reduction).

Consider finite hypothesis spaces F1 and F2, such that F1F2 and F1 is uniformly drawn from F2 without replacement. For a loss l bounded by b if the Rashomon ratio is at least

Rratioanc(2,γ)1-ϵ1|1|

then for any ϵ>0, with probability at least (1-ϵ)2 with respect to the random draw of functions from F2 to form F1 and with respect to the random draw of data:

|L(f2*)-L^(f^1)|γ+2blog|1|+log2/ϵ2n,

where f2*argminf2F2L(f2), and f^1argminf1F1L^(f1).

If |F1|=100000 then to get the bound (2) to hold with probability at least 99%
          the Rashomon ratio should be R^ratio(2,γ)(5.026×10-8)%.
If |F1|=10000 then to get the bound (2) to hold with probability at least 99%
          the Rashomon ratio should be R^ratio(2,γ)(5.026×10-7)%.
If |F1|=1000 then to get the bound (2) to hold with probability at least 99%
          the Rashomon ratio should be R^ratio(2,γ)(5.026×10-6)%.
Table 2: Examples of the possible usage of Theorem 4.3.

Please refer to Table 2 for possible values of the lower bound on the Rashomon ratio, given |1| and ϵ.

Note that if each model from the Rashomon set has different performance according to the performance function ζ, than Theorem 4.2 and 4.3 apply to the pattern Rashomon ratio as well.

Theorems 4.2 and 4.3 guarantee that if the true anchored Rashomon set is sufficiently large, then with high probability, the best empirical risk over the simpler class 1 is close to the best possible true risk over the larger class 2. The generalization guarantee comes from the size of the simpler class 1.

The bound shows directly how the size of the Rashomon set could potentially impact generalization guarantees, even if 1 is not derived from 2 by sampling as in the theorem. In particular, as the true anchored Rashomon ratio increases, the chance increases that the empirical risk minimizer of 1 and true minimizer of 2 will be close together.

Theorems 4.1, 4.2, and 4.3 show the explicit benefits of simpler models, however, they have caveats as discussed earlier. As discussed, the theorems’ assumption is unverifiable. Even if it holds, we cannot check it, as we cannot actually compute the true anchored Rashomon set in practice. However, there might be cases when empirical and true Rashomon sets are close, and therefore it is beneficial to know the properties of one to understand of the properties of the other. In particular, with high probability, if a fixed model is contained within the anchored Rashomon set, it also belongs to slightly larger true anchored Rashomon set. The reverse statement holds as well.

Statement 4.1 (Empirical and true anchored Rashomon sets).

For a loss l bounded by b and for any ϵ>0, if fR^setanc(F,γ) then with probability at least 1-e-2n(ϵ/b)2 with respect to the random draw of training data, fRsetanc(F,γ+ϵ).

An analogous statement holds for a model from a true anchored Rashomon set:

Statement 4.2 (True and empirical anchored Rashomon sets).

For a loss l bounded by b and for any ϵ>0, if fRsetanc(F,γ) then with probability at least 1-e-2n(ϵ/b)2 with respect to the random draw of training data,

fR^setanc(,γ+ϵ).

Statements 4.1 and 4.2 show that, for a fixed model, with high probability, its membership in the true anchored and the anchored Rashomon sets are closely related. Therefore, when we consider a given dataset S to compute a model in the Rashomon set, we can infer that this model is likely to belong to a related true Rashomon set.

When the anchored and the true anchored Rashomon sets are not close, it most likely means that the anchored Rashomon set contains overfitted models. These models would not be contained in the true anchored Rashomon set. Because of this, the generalization bounds for models from the true anchored Rashomon set are much tighter than those of Section 4.3 later.

Theorems 4.1, 4.2, 4.3 in this section do not take advantage of the fact that we can investigate 2 empirically, and more easily than we can investigate 1; these theorems instead only discuss the exploration of 1. In what follows, we will study empirical Rashomon sets. Because we are studying empirical Rashomon sets, and because we will work with 2 instead of 1, we will need some mechanism to approximate 2 in terms of 1 and to ensure that functions from 2 generalize; for these purposes, we will use smoothness of the loss over function space.

4.2 Existence of simple-yet-accurate models with good generalization

As discussed above, we will now consider empirical Rashomon sets, rather than true Rashomon sets, for the reasons discussed above: the assumptions in the earlier theorems are unverifiable, and rely on optimization of the less complex function class 1 rather than the more complex function class 2. If optimization over 2 is easier (as it is less constrained), we may want to optimize over 2 first, and be guaranteed the existence of at least one function in 1 based on what we observe with 2. Thus, what we aim to prove in this section is the existence of functions in 1 that are in the Rashomon set of 2. In order to do this using an approximating set argument, we use more assumptions than in the previous section, specifically smoothness. The following theorem shows that, under certain conditions, if there is a function close to 2’s minimizer in function space that is also in 1, then it is a function we would be looking for: it is also in the Rashomon set of 2 and it generalizes.

For a hypothesis space and some f let us define the δ-ball of functions centered at f as Bδ(f)={f:f-fpδ}. A loss l:×𝒳𝒴 is said to be K-Lipschitz, K0, if for all f1,f2 and for all z𝒵: |l(f1,z)-l(f2,z)|Kf1-f2p.

Theorem 4.4 (Existence of a simpler-but-accurate model, and its generalization I).

For K-Lipschitz loss l bounded by b consider hypothesis spaces F1 and F2 such that F1F2. If there exists f1¯F1 such that f2^-f1¯pθK, where f2^ is the empirical risk minimizer within F2, then for a fixed parameter ϵ(0,1):

  1. 1.

    f1¯ is in the Rashomon set R^set(2,θ).

  2. 2.

    With probability greater than 1-ϵ w.r.t. the random draw of training data,

    |L(f1¯)-L^(f1¯)|2KRn(1)+blog(2/ϵ)2n,

    where Rn() is the standard Rademacher complexity of a function class .

In the case of one local minimum, if that minimum contains many models that perform approximately the same, then this minimum yields a large Rashomon set. This minimum could be wide, or alternatively the loss could be somewhat flat with a bounded derivative. These are often cases where the loss is locally Lipschitz continuous on the Rashomon set with a small Lipschitz constant K, which would create a tighter bound.

Theorem 4.4 also illustrates how we can use a known Lipschitz continuity to choose a Rashomon parameter θ. In particular, if we would like to consider a δ-ball in the hypothesis space f^-fp<δ, then we would choose θ=Kδ, which would keep the bound as small as possible but still permit the result to hold.

Theorem 4.5 below is a variation on Theorem 4.4. It still uses the approximating set argument, but also requires the Rashomon set to be large enough to include a ball of functions. As long as the set of simpler functions is distributed well among the full function class, the ball contains at least one function from the simpler class.

Theorem 4.5 (Existence of a simpler-but-accurate model, and its generalization II).

For a K-Lipschitz loss l bounded by b consider hypothesis spaces F1 and F2 such that F1F2 and for every model f2R^set(F2,θ) there exists a model f1F1 such that f2-f1pδ. If the Rashomon set is large, e.g. it contains a ball of size at least δ, that is, R^set(F2,θ)Bδ(), then there exists a model f1¯R^set(F2,θ), such that for a fixed parameter ϵ(0,1):

  1. 1.

    f1¯ is from simpler class 1.

  2. 2.

    With probability greater than 1-ϵ w.r.t. the random draw of training data,

    |L(f1¯)-L^(f1¯)|2KRn(1)+blog(2/ϵ)2n,

    where Rn() is the standard Rademacher complexity of a function class .

As we have made approximating set arguments several times, with the most recent theorem (Theorem 4.5) making an assumption that all models from 2’s Rashomon set are close to a model in 1, let us discuss this assumption. The field of Approximation Theory provides general conditions under which classes of functions can approximate each other. Given a target function from one class, we want to know whether a sequence of functions from another class can converge to the target. Table 3 shows classes of functions 2 that can be approximated with functions from classes 1 within δ using a specified norm. For instance, piecewise constant functions, such as decision trees, can be approximated by smooth functions. In general it is much easier to optimize over classes of smooth functions than piecewise constant functions.

2 1 δ Source
fL(Ω), f[m,M] sNS(Ω), N#constants,
sN – piecewise constant
f-sNM-m2N [15, 14]
fWp1(Ω), 1p,
where Wp1 is a Sobolev space
sΔ(f)S(Ω), N - #constants,
sΔ – piecewise constant,
Δ – fixed partition Ω=(0,1)d
f-sΔ(f)pCN-1/d|f|Wp1Ω [14]
f{xk, kN} P(n) – polynomials of degree
at most nN
f-P(n)12k-1j>(n+k)/2(kj) [34]
fC[0,1] is a non-constant symmetric boolean function
on x1,..,xn
P(d) – algebraic polynomials
of degree d
f-P(d)𝒪(n(n-Γ(f))) [35]
fLipM(α), f is Lipschitz continuous with constant M Nn:[a,b] is a feedforward neural network with one layer and bounded, monotone and odd defined activation function,
n
supx[a,b]|f(x)-Nn(x)| 5M2(b-an)α [9]
fLp(I), where Id is a cube in d, Wr(Lp(I)) – Sobolev semi norm Pr – space of polynomials of
order r in d, constant C depends on r
infpPrf-pLp(I)C|I|r/d|f|Wr(Lp(I)) [15]
Table 3: Examples of function approximation in different functional classes: a function from class 1 approximates a function in class 2 with given guarantee δ.

The previous theorems showed the existence of a single function from 1 with desirable properties. Ideally, we would want multiple functions in 1 with these properties, since in practice, when we search 1, we may not find the single function guaranteed by the previous theorem. Theorem 4.6 below guarantees the existence of more such functions.

For a hypothesis space , define as a ϵ-packing a finite set Ξ={ξ1,,ξk|ξi} such that ξi-ξjp>δ and Bδ/2(ξi)Bδ/2(ξj)= for all ij. The packing number (,δ) is the largest δ-packing. Then we have the following:

Theorem 4.6 (Existence of multiple simpler models).

For K-Lipschitz loss l bounded by b, consider hypothesis spaces F1 and F2 such that F1F2 and for every model f2R^set(F2,θ) there exists a model f1F1 such that f2-f1pδ. Then there exists at least B=B(R^set(F2,θ),2δ) functions f11¯,f12¯,f1B¯R^set(F,θ) such that:

  1. 1.

    They are from a simpler class: f11¯,f12¯,f1B¯1.

  2. 2.

    With probability greater than 1-ϵ w.r.t. the random draw of training data, |L(f1i¯)-L^(f1i¯)|2KRn(1)+blog(2/ϵ)2n, for all i[1,..,B], where Rn() is the Rademacher complexity of a function class .

From Theorem 4.6 we see that larger Rashomon sets have larger packing numbers, and therefore contain more simpler models with good generalization guarantees.

Note that in Theorems 4.4, 4.5 4.6 other complexity measures from the learning theory could be used, such as VC dimension or fat-shattering dimension, to bound the generalization of the hypothesis space 1. We chose the Rademacher complexity as it provides the tightest bound among standard measures.

As mentioned briefly earlier, Theorem 4.6 could have implications in practice, because if our data and algorithm admit a large Rashomon set on a complex class, Theorem 4.6 suggests that it could be beneficial to locate models from simpler classes within the Rashomon set. These simpler models could be, for instance, models that are constrained to be interpretable. Finding interpretable models can often be computationally demanding, since this generally involves minimizing training loss subject to interpretability constraints, which are often discrete or challenging in other ways. The existence of a large Rashomon set on a more complex class of functions implies that there exist possible many solutions to the constrained optimization problem over the simpler class, and thus it is worthwhile to actually solve this optimization problem over the simpler class. In other words, if the Rashomon set is large, and the other conditions are obeyed, Theorem 4.6 acts as a type of proof that many interpretable-yet-accurate models would exist, prior to actually finding them.

4.3 Generalization bounds for all models from the Rashomon set

By using 1 as an approximating set for 2 under smoothness assumptions, we can show that not only do 1’s functions generalize, but also the entire set of functions within 2’s Rashomon set generalize. The generalization guarantee uses the complexity measure of the simpler class 1, rather than that of the more complex class 2.

The importance of this result, stated formally in Theorem 4.7, is the implication that good approximating sets mean that the learning problem is inherently lower complexity than a naïve learning theory analysis (which uses 2’s complexity measure) might suggest. This bound is basic and is a distilled version of standard analyses that use approximating sets.

Theorem 4.7.

[Generalization and reduced complexity of the Rashomon set of F2] For a K-Lipschitz loss l bounded by b consider two hypothesis spaces F1F2 such that for any model f2R^set(F2,θ) there exists a model f1F1 such that f2-f1pδ, then for all f2R^set(F2,θ) and for any ϵ>0 with probability at least 1-ϵ:

|L(f2)-L^(f2)|2K(δ+Rn(1))+blog(2/ϵ)2n,

where Rn(F) is the standard Rademacher complexity of a function class F.

Theorem 4.7 shows that if 1 is a good approximation set for the Rashomon set of 2, then we can obtain a generalization guarantee for any function from the Rashomon set of the complex class 2 using only the complexity of the simpler class 1. As a reminder, Table 3 shows examples of function classes where good approximating sets occur. The approximating set is better when δ is small, and this is when the generalization bound is tighter.

The next bound is a basic covering number bound that uses the size of the Rashomon set to trade off an upper bound on the generalization error with the probability that the bound holds. If the Rashomon set is large, the probability that the bound holds becomes better. The right side of the bound is approximately 5θ, where θ is the Rashomon parameter.

For a hypothesis space , the δ-cover is a finite set H={ihi,hi} such that for any f there exists hH such that f-hpδ. In other words, the hypothesis space is contained in the union of a finite set of δ-balls hHBδ(h). The covering number 𝒩(,δ) is the smallest number of δ-balls required to cover the whole space : 𝒩(,δ)=inf{|H|:H is a δ-cover}. In the bound below, we assume that the Rashomon set is covered by one ball of size δθ centered at f^ (this does not need to be a ball in a minimal cover), and we use a minimal cover over the hypothesis space with the same ball size δθ as this ball that covers the Rashomon set.

Theorem 4.8.

[Occham’s razor Rashomon bound] For a K-Lipschitz loss l bounded by b for all fR^set(F,θ) we have that with probability at least 1-2N(F,δθ)e-2n(θ/b)2 with respect to the random draw of data:

|L(f)-L^(f)|4Kδθ+θ,

where K is a Lipschitz constant and δθ+ is a radius of the Rashomon set, given by the Rashomon assumption:

δθ=maxfR^set(,θ)f-f^p, (3)

where recall that f^argminfFL^(f).

As the Rashomon set grows, the covering number of the hypothesis space decreases, and the probability with which the bound holds increases. Thus, if the Lipschitz constant K is on the order of θ/δ, then it is beneficial to have a larger Rashomon set, as it leads to a smaller covering number, and a higher probability. The generalization guarantee of approximately 5θ means that the test error of functions in the Rashomon set is guaranteed not to be too much more than the observed difference in error between functions in the Rashomon set during training.

Typically, chaining is used to tighten covering number bounds by integrating over all possible ball sizes, but in the bound above, we have chosen a single δθ to allow the interpretation provided above.

A large Rashomon set is thus a certificate of better generalization, because it allows us to find models from a simpler class, and/or use complexity measures of a simpler class. Despite the connection of the Rashomon set to generalization, the size of the Rashomon set is not the same as the typical complexity measures of function classes. Section 6 relates the size of the Rashomon set to several established complexity measures. Before that, in Section 5, we discuss computation.

5 Computation of the Rashomon ratio and volume

Often in practice, we do not need to compute the Rashomon ratio, as there is a practical way to check if the Rashomon ratio could be large as we discuss in Section 7. However there are cases when we actually can derive a closed-form solution for the Rashomon volume or estimate the Rashomon ratio using sampling techniques. We discuss these methods of the Rashomon ratio computation in this section.

The Rashomon ratio can be viewed as a probability that a model is contained within the Rashomon set, taking a uniform prior over the hypothesis space. In Equation (1) we compute this probability by comparing the volume of the Rashomon set to that of the full hypothesis space. We can compute this probability directly by sampling models from the hypothesis space, and taking the fraction of times the sample lies inside the Rashomon set. By Hoeffding’s inequality, this rejection sampling procedure has probabilistic guarantees and can be used in cases when the hypothesis space is discrete, e.g., tree structures. We discuss this in Section 5.1.

When the hypothesis space has a parameterized representation, we can compute the Rashomon volume in parameter space. We assume that we can parameterize each model f in a hypothesis space with a unique, finite number of parameters and denote f(z)=fω(z), where ω is a parameter vector. For a parameter space Ω, the parametric hypothesis is denoted: Ω={fω(z):ωΩp}. To be consistent with how the Rashomon set is defined on hypothesis spaces (using similarity directly between functions, rather than similarity between parameters), we will assume that there exists a parametrization such that distances according to the provided metric are similar in both the hypothesis and the parameter space. Otherwise, due to overparametrization or reparametrization, the Rashomon volume can be artificially changed [17]. This problem can be avoided by the choice of parameterization that encourage distances in parameter space to be similar to distances in hypothesis space. In Sections 5.2, we show how to compute the Rashomon volume in parameter space directly in closed form for ridge and least squares regression.

5.1 Sampling methods

As before, assume that there exists a prior distribution ρ over the hypothesis space . From the definition, the Rashomon ratio can be computed as a probability of a model being in the Rashomon set:

R^ratio(,θ)=P[fR^set(,θ)]=𝔼fρ𝟙[fR^set(,θ)].

To approximate the Rashomon ratio, we perform rejection sampling with replacement. In particular, after k draws from distribution ρ,

P^[fR^set(,θ)]=1ki=1k𝟙[fR^set(,θ)].

By Hoeffding’s inequality: P(|P^[fR^set(,θ)]-P[fR^set(,θ)]|t)2e-2kt2, or alternatively 1-α=P(|P^[fR^set(,θ)]-P[fR^set(,θ)]|<t)1-2e-2kt2. Then, in order to estimate the Rashomon ratio P[fR^set(,θ)] to within t, with a (1-α) confidence interval, we need to sample at least klog2/α2t2 hypotheses from . The guarantees from this rejection sampling approach are tight enough to be used in practice, and can be used in most hypothesis spaces where hypotheses can be randomly generated.

There are cases when the Rashomon set is very small and therefore rejection sampling contributes very little to the Rashomon ratio approximation. It makes sense to draw models from the region around the Rashomon set instead of from the set of all reasonable models. Importance sampling allows us to sample from an alternative distribution, called the proposal distribution, that is concentrated on the importance region. However, after sampling is done, we need to adjust the weight of the sample to match the probability of sampling it from the original distribution. Let ρ be a target distribution over the hypothesis class and q be a proposal distribution that is focused around the Rashomon set. We can estimate the Rashomon ratio through importance sampling as follows:

R^ratio(,θ)=𝔼fqρ(f)q(f)×𝟙[fR^set(,θ)].

In Section 8.1 we discuss how to design a target distribution for the decision tree hypothesis space based on the model’s structure and labels of the dataset.

5.2 Analytical calculation of Rashomon volume for ridge regression

A special case of when the Rashomon volume can be computed in closed form in a parameter space is ridge regression. For a space of linear models Ω={ωTx,ωp}, ridge regression chooses a parameter vector by minimizing the penalized sum of squared errors for a training dataset S=[X,Y]:

minωL^(ω)=minω(Xω-Y)T(Xω-Y)+CωTω, (4)

where the optimal solution of the ridge regression estimator is ω^=(XTX+CIp)-1XTY.

Geometrically, the optimal solution to ridge regression will be a parameter vector that corresponds to the intersection of ellipsoidal isosurfaces of the sum of squares term and a hypersphere centered at the origin, with the regularization parameter C determining the trade off between the loss and the radius of the sphere. More generally, isosurfaces of the ridge regression loss function are ellipsoids, and the volume of such an ellipsoid corresponds to the Rashomon volume. Using this geometric intuition, we compute the Rashomon volume in parameter space by the following theorem:

Theorem 5.1.

For a parametric hypothesis space of linear models FΩ={fω(x)=ωTx,ωp} and a dataset S=X×Y, the Rashomon set R^set(FΩ,θ) of ridge regression is an ellipsoid, containing vectors ω such that:

(ω-ω^)TXTX+CIpθ(ω-ω^)1,

and the Rashomon volume can be computed as:

𝒱(R^set(Ω,θ))=J(θ,p)i=1p1σi2+C, (5)

where σi are singular values of matrix X, J(θ,p)=πp/2θp/2Γ(p/2+1) and Γ() is the gamma function.

Note that for least squares regression, we can use results of Theorem 5.1 with penalization constant C=0. When features in matrix X are linearly dependent, some singular values will be 0. In this case the volume of the Rashomon set based on Equation 5 goes to infinity. We avoid this problem by making sure that the Gram matrix of the feature matrix X is always positive definite, meaning that all singular values are non-zero. One way to ensure this is to perform principal component analysis and use the most significant components as replacement features.

Interestingly, from Theorem 5.1, it follows that for ridge regression, the Rashomon volume depends on the feature space only and does not depend on the regression targets Y. Indeed, assume that every parameter vector ω such that fωR^set(Ω,θ) can be represented as ω=ω^+δ. By a simple transformation we have that L^(fω)-L^(fω^)=δTXTXδ, meaning that if we take a step in parameter space, the empirical risk difference will depend only on the feature space and the step itself, and not on the targets of the problem. This observation can help us choose the parameter θ as θ=δTXTXδ if we want to ensure some dependence between the optimal model ω^ and a model of interest ω. Then, by choosing the direction as δ=ω-ω^ we can compute the Rashomon parameter.

For other algorithms, the Rashomon volume generally depends on the targets; in that sense, ridge regression is unusual.

The cases we considered in this section restrict the structure or properties of the learning problem, but these restrictions allow us to compute the Rashomon volume directly, or estimate the Rashomon ratio with high probability. In Appendix E we further discuss how to approximate the Rashomon volume to any pre-specified precision when the Rashomon set is convex. We also provide an optimization problem to under-approximate the Rashomon volume in the parameter space for the support vector machine classification problem.

We will use both sampling techniques and the ridge regression closed-form solution for the Rashomon experiments in later sections.

Over the decades, the statistical learning theory community developed beautiful measures that show the expressive power and richness of hypothesis spaces, and how they relate to data and algorithms. The most popular are VC dimension, algorithmic stability, geometric margins, and Rademacher complexity. The Rashomon ratio is different from all of these well-known complexity measures: we can find cases where there is no correspondence between them. In Section 6, we illustrate that there exist datasets and distributions that illuminate differences between the Rashomon ratio and the standard complexity measures.

6 Rashomon ratio as compared to simplicity measures from learning theory

The Rashomon ratio enables simplicity of a learning problem, but it is different from well-known complexity measures from learning theory. The Rashomon ratio depends on a loss function, the hypothesis class, and a dataset, while the majority of other measures are either dataset-agnostic or focus on properties of a specific model in the class. We will compare the Rashomon ratio to different quantities that are used for generalization in statistical learning theory, including VC dimension [45], the stability of a learning algorithm [5], geometric margins [38, 7], Rademacher complexity and local Rademacher complexity [3], because the Rashomon ratio is both similar and different to each of these measures. We will use demonstrations to show the differences.

VC dimension. Vapnik-Chervonenkis (VC) dimension [45] and the Rashomon set are completely different concepts in terms of data-dependence. The VC dimension is the cardinality of the largest set of points that the learning algorithm can shatter. The hypothesis space shatters a set of points if it can achieve any possible target labeling on this set. In other words, the VC dimension shows the expressive power of a hypothesis space for any dataset including an extreme arrangement of data points and labels. On the contrary, the Rashomon set depends on an empirical risk minimizer that we compute directly for a specific data set, which may not be extreme.

Algorithmic stability. The main motivation for algorithmic stability theory is to ensure robustness of a learning algorithm. Following Bousquet and Elisseeff [5], we define the hypothesis stability of a learning algorithm as follows.

Definition 6.1.

A learning algorithm 𝒜 has β hypothesis stability with respect to the loss l if for all i{1,,n},

𝔼S,z[|l(fS,z)-l(fSi,z)|]β,

where βR+, hypothesis fS is learned by an algorithm 𝒜 on a dataset S, loss l(fS,z)=ϕ(fS(x),y) for z=(x,y), dataset S={z1,zn}, and Si is modified from the training data by removing the ith element of the dataset: Si={z1,,zi-1,zi+1,zn}.

In Section 5.2 we showed that, in case of linear least squares regression, the Rashomon volume depends on features X only and does not depend on regression targets Y. In contrast, hypothesis stability depends heavily on Y. In fact, if we can control how we change the set of targets, hypothesis stability can be made to change by an arbitrarily large amount – the Rashomon volume is fundamentally different from hypothesis stability. This is formalized in Theorem 6.1.

Theorem 6.1.

Consider a distribution PX over a discrete domain X={x1,xN} and a learning algorithm A that minimizes ridge regression’s empirical risk L^ for a linear hypothesis space FΩ, as in (4). For any λ>0 there exist joint distributions PX,Y1 and PX,Y2 where for X drawn i.i.d. from PX, Y1 is drawn from PY1|X over Y|X and Y2 is drawn from PY2|X over Y|X, such that the expected Rashomon ratios are the same:

𝔼PX,Y1[Rratio𝐒1(Ω,θ)]=𝔼PX,Y2[Rratio𝐒2(Ω,θ)],

yet hypothesis stability constants are different by an arbitrarily chosen value of λ:

β~2-β~1λ,

where S1 and S2 denote datasets S1=[X,Y1] and S2=[X,Y2], β~1 is the hypothesis stability coefficient of algorithm A for distribution PX,Y1 and β~2 is the hypothesis stability coefficient for distribution PX,Y2.

Geometric margin. Intuitively both the Rashomon ratio and the width of the geometric margin are data-dependent and show how expressive the hypothesis space is with respect to a given dataset. However, the margin depends on the closest data points to the decision boundary (e.g. support vectors), while the Rashomon set does not necessarily rely on the support vectors and may depend on the full dataset. Theorem 6.2 summarizes this idea.

Before stating the theorem, we provide a definition of the margin. For the parametric hypothesis space of linear models Ω=f(x)=ωTx,ωp and binary classification, denote d+ and d- as the shortest distances from a decision boundary to the closest points with targets y=1 and y=-1 respectively. Then the margin d is a sum of these distances d=d++d- [7]. Moreover, for the model fω^ that maximizes the margin, the margin width is 2ω^2.

Theorem 6.2.

For any fixed 0<λ<1, there exists a fixed hypothesis class FΩ, a Rashomon parameter θ, and there exist two datasets S1 and S2 with the same empirical risk minimizer f^FΩ such that: the width of the geometric margin d is the same for both datasets, yet the Rashomon ratios are different:

|RratioS1(Ω,θ)-RratioS2(Ω,θ)|>λ.

Empirical local Rademacher complexity. The empirical Rademacher complexity is another complexity measure of the hypothesis space. As in [3], for binary classification we define it as follows.

Definition 6.2.

Given a dataset S, and a hypothesis class of real-valued functions, the empirical Rademacher complexity of is defined as:

R^nS()=1n𝔼σ[supfFi=1nσif(zi)],

where σ1,σ2,,σn are independent random variables drawn from the Rademacher distribution i.e. P(σi=+1)=P(σi=-1)=1/2 for i=1,2,,n.

Since we are interested only in models that are inside the Rashomon set, in this section we will consider local empirical Rademacher complexity [3], which is defined using the Rashomon set R^set(,θ). Empirical Rademacher complexity measures how well the hypothesis space can fit to random assignments of the labels. In contrast, the Rashomon volume is different, as it measures the number of models that are close to optimal. In other words, the Rashomon set benefits from having multiple similar models, while Rademacher complexity treats them as equivalent. In the following theorem, we provide a simple example to show this discrepancy between the two measures.

Theorem 6.3.

For 0<λ<1, there exist two datasets S1 and S2, a hypothesis class FΩ, and a Rashomon parameter θ such that the local Rademacher complexities defined on the Rashomon sets for S1 and S2 are the same:

R^nS1(R^set(Ω,θ))=R^nS2(R^set(Ω,θ)),

yet the Rashomon ratios are different:

|RratioS1(Ω,θ)-RratioS2(Ω,θ)|>λ.

Pattern Rashomon ratio. The pattern Rashomon ratio is different from both the Rademacher complexity and geometric margins. Intuitively the pattern Rashomon ratio is closer to the Rademacher complexity, as it tries to find the number of models that fit the best under different label permutations; in contrast, the standard multiplicity-based Rashomon ratio is closer to geometric margins (the multiplicity-based Rashomon ratio tends to be larger when the classification margins are larger).

Additionally, there is a straightforward connection between the growth function and the pattern Rashomon ratio. Recall that a growth function, or a shattering coefficient, is the maximum number of ways any n data points can be classified using the hypothesis space. The connection is that the volume of the hypothesis space measured using pattern distance is exactly the growth function defined on the current dataset To clarify, the pattern Rashomon ratio and the growth function are equivalent under very specific conditions: (i) the Rashomon set is the full hypothesis space (this is unlikely in practice), (ii) we consider classification with 0-1 loss as the performance measure ζ, and (iii) we consider only one dataset and do not take expectation over all datasets (as is usual for the growth function).

Now that we have established what we expect to see theoretically from the Rashomon ratio, let us move to experiments.

7 Larger Rashomon ratios correlate with similar performance of machine learning algorithms, and good generalization

In this section, we present several observed properties of larger Rashomon ratios.

We would like to measure the Rashomon ratio of a hypothesis space that includes a broad variety of functions, including some that are capable of fitting the data approximately as well as boosted decision trees, support vector machines with gaussian kernels, or other complex machine learning algorithms. Since it is not clear how to perform direct sampling of this class in function space, we chose to approximate this space by a function class that (1) we could sample from, and (2) would potentially be a good approximating set for the desired hypothesis space. The class we chose was decision trees of depth 7, which is flexible and large, yet easy to sample from. Since decision trees can refine an input space arbitrarily finely as their depth increases, we can view sufficiently deep decision trees as a rich hypothesis space that approximates many other types of hypothesis spaces, including those used by other machine learning methods. Thus, it is conceivable that large Rashomon sets for decision trees translate into large Rashomon sets for the hypothesis space we would like to consider in function space.

To estimate the Rashomon ratio of depth 7 decision trees, we used importance sampling, as discussed in Section 5.1. We used two different types of importance sampling: leaf and feature based. For the leaf based approach, the proposal distribution assigns the correct labels to the leafs of the tree based on the training data. Since the data are populated on a bounded domain, to grow a tree up to a depth D fully, we make 2D-1 splits. For the feature based importance sampling with probability ϵ we choose some of these splits uniformly at random and with probability 1-ϵ according to Gini index. We set ϵ=0.5 in our experiments. For each dataset and each depth, we sampled from 250,000 to 500,000 decision trees depending on a dataset complexity. We choose the Rashomon parameter θ to be 5%, and, therefore, all the models in the Rashomon set have empirical risk not more than L^(f^)+0.05, where L^(f^) is the lowest achievable empirical risk across all algorithms we considered. (We vary the θ value later; the results did not seem to be sensitive to that choice.)

Separately, we assess whether many different machine algorithms perform similarly on the dataset. If many different algorithms with different functional forms and different levels of smoothness perform similarly on a dataset, the Rashomon set contains all of these diverse functions. In that case, we conjecture that the Rashomon set could be large. In this first experiment, we test this conjecture, by investigating whether large Rashomon sets (as measured with decision trees of depth 7) correlate with many machine learning methods performing similarly on the same dataset.

Further, we test whether models coming from a large Rashomon set tend to generalize well, as Theorem 4.8 suggests that they might.

Our experiments considered five popular machine learning algorithms: logistic regression, CART, random forests, gradient boosted trees, and support vector machines with radial basis function kernels. We use 36 machine learning classification datasets from the UCI Machine Learning Repository [16], among which 15 have categorical features and 21 have real-valued features. The majority of the datasets are binary classification datasets although we also considered datasets with three, four and six classes. The number of features varies from 3 to 784, with the majority of the datasets being in the 15–25 feature range. Appendix G contains a description of the datasets we considered.

To recap, we used decision trees of depth 7 to estimate the Rashomon ratio directly, and separately use a variety of machine learning methods on the same data to assess whether large measured Rashomon ratios correlate with similar training performance across algorithms, as well as good generalization between training and test sets.

7.1 Similar Performance Across Algorithms

Figure 5(a) illustrates the performance of the five machine learning algorithms with the largest Rashomon ratios in the space of decision trees of depth seven. Across all of these cases, larger Rashomon ratios led to approximately similar training results (within 5% difference between algorithms). Moreover, all of the models chosen by the algorithms generalized well and produce very similar test accuracy.

Interestingly, the converse statement, that similar performance across different algorithms should lead to large Rashomon sets, does not always hold; sometimes, generalization occurs with small Rashomon ratios. This observation could be explained in several different ways. First, the Rashomon ratio may not be the only driver of good generalization performance. Second, even when the Rashomon ratio is a good driver of generalization performance, it may appear artificially small because of a poor representation of data or poor choice of the ratio’s denominator. For instance, if the features are highly correlated, this artificially deflates the size of the Rashomon ratio, as discussed in Appendix H. Moreover, if the denominator of the Rashomon ratio (which is the size of the hypothesis space) is poorly designed to include an overly large number of models, then the Rashomon ratio may appear artificially small.

The issues with measuring the Rashomon ratio may be a possible explanation for the results in Figure 5(b), which shows datasets with high-performing algorithms, yet (by the way we measured it) a small Rashomon ratio.

7.2 Good Generalization

In all cases we observed, if training performance was consistent across algorithms, test performance was also similar. One thing we notably did not observe were cases where algorithms did not generalize, performance differed across algorithms, and the Rashomon set was large. Actually, we did not observe cases where the Rashomon set was large and performance differed among algorithms. All of these observations are consistent with (but do not definitively prove) the theory that consistent performance across algorithms occur because of large Rashomon sets, which in turn leads to generalization.

Figure 5(c) shows small Rashomon sets and wildly different performance across algorithms, and in that case, sometimes the models generalize and sometimes they do not. We show one example of each of these cases in Figure 5(c). Our theory does not apply to the case of small Rashomon sets, but again the appearance of small Rashomon sets could be due to poor ways of measuring the size of the Rashomon sets in our experiments.

Figure 2:
Figure 3:
Figure 4:
Figure 5: (a) Examples of experiments on four datasets showing that larger Rashomon ratios lead to similar performance of five machine learning algorithms. All the algorithms generalize well and have similar test accuracy. (b)-(c): Examples showing that smaller Rashomon ratios do not imply a performance difference between machine learning algorithms. Even with low Rashomon ratios, algorithms can be highly accurate and generalize well, as shown in Figure (b). On the other hand, when the Rashomon ratio is small, sometimes algorithms can perform differently or fail to generalize, as shown in Figure (c).

There are several conclusions we can make from our experiments. The most important conclusion is that when the Rashomon ratio is observed to be large, all algorithms perform similarly, and their models tend to generalize. Given these conclusions, one may ask whether it is desirable to aim for the largest possible Rashomon ratio in all cases. In the next section, we introduce Rashomon curves and address this question empirically.

8 Rashomon curves

In this section we introduce a trend, namely the Rashomon curve, that we observe experimentally across all classification datasets we downloaded from the UCI Machine Learning Repository [16].

Consider a hierarchy of hypothesis spaces H0H1HT, where each Ht={h:𝒳𝒴}, and 𝒳=[0,1]p is a unit hypercube. We consider the empirical risk over the loss function as follows L^(H)=minhH1ni=1nϕ(h(xi),yi). For a fixed θ, we choose the parameter θt for each hypothesis space Ht so that ,R^set(Ht,θt) would contain models with empirical risk not more than T^(HT)+θ. Therefore, θt:=max(0,θ-(L^(Ht)-L^(HT))).

The Rashomon curve is a function from the empirical risk to the log of the Rashomon ratio for a hierarchy of hypothesis spaces. More formally, for a hierarchy of hypothesis spaces H0,HT, the Rashomon curve is obtained by connecting the following points: (L^(H0),R^ratio(H0,θ0)),, (L^(Ht),R^ratio(Ht,θt)),, (L^(HT),R^ratio(HT,θT)).

8.1 Rashomon curves tend to exist often, and are not obvious

For a hierarchy of hypothesis spaces, the Rashomon curve shows that as the size of the hypothesis space grows, the empirical risk of classifiers within the space first decreases and then the Rashomon ratio decreases. As a result, the Rashomon curve has a Γ-shaped trend as in Figure 10 (a)-(c).

(a)
  
(b)
  
(c)
  
(d)
Figure 6:
Figure 7:
Figure 8:
Figure 9:
Figure 10: (a)-(d) Illustrations of the Rashomon curve’s general shape. For a hierarchy of hypothesis spaces, each space is represented with a colored dot, where different colors corresponds to different hypothesis spaces. As the size of the hypothesis spaces grow, first the empirical risk decreases (horizontal trend) and then the Rashomon ratio decreases (vertical trend). We empirically observed either the full or partial Rashomon curve in every dataset we considered. (e)-(h): Rashomon curves for the UCI classification datasets with categorical features. The hierarchy of hypothesis spaces is the set of decision trees from depth one to seven. We display only hypothesis classes that have nonzero measured Rashomon ratio. The Rashomon parameter θ was set to be 0.05 for all experiments.

The horizontal part of the Γ-shape corresponds to a decrease in the empirical risk as we move through the hierarchy of hypothesis spaces. If the hypothesis space with the largest size we consider does not achieve a low value of the empirical risk (e.g., in a case of a complex learning problem) we will observe only the horizontal part of the Rashomon curve. This pattern indicates that none of the hypothesis spaces considered are complex enough to learn the training data well. What we call “the horizontal part” of the Γ curve is not always strictly horizontal, it can be sloped as in Figure 10 (c). The slope on this axis is actually just caused by using a fixed threshold θ for the Rashomon set; for simpler hypothesis spaces, few to no models are in the hypothesis space, which means the Rashomon set is actually very small. As we move through more complex hypothesis spaces, the number of models in the Rashomon set grows (along with the ratio), but then the number of models in the hypothesis space explodes, leading to extremely small Rashomon ratios.

The vertical part of the Rashomon curve corresponds to changes in the Rashomon ratio. When a learning problem is easy (e.g., separable datasets, datasets with large margin, datasets with only one or two relevant features), a high accuracy on the training data is easily achievable with a smaller hypothesis space. In that case, we will observe only the vertical part of the Rashomon curve as illustrated in Figure 10 (d). The vertical part of the curve corresponds to more complex hypothesis spaces, which is where overfitting can occur. However, the steep drop in Rashomon ratio tends to overwhelm the increases in training accuracy that correspond to overfitting.

The existence of Rashomon curves is not obvious. We could not replace the Rashomon ratio on the vertical axis of the curve with another complexity measure and expect to see the same curve. In fact, a plot of any non-dataset-driven complexity measure (VC dimension, Rademacher complexity, covering number) versus accuracy would never yield a Rashomon curve. This is because such a measure always increases with complexity – it could never yield a flat horizontal trend, as we observe with the Rashomon curve.

As before, for our experiments we used 36 UCI datasets from the UCI Machine Learning Repository. For the hierarchy of hypothesis spaces, we chose fully-grown decision trees up to depth D, where D[1,7]. we denote each space by DT-D. We computed the Rashomon ratios by importance sampling of the decision trees for each depth D. We choose the Rashomon parameter θ to be 5%, and all the models in the Rashomon sets of the same dataset have the empirical risk not more than L^(f^)+0.05, where L^(f^) is the lowest achievable empirical risk of the most complicated class of models in our hierarchy (which is DT-7).

Figure 10 (e)-(h) show the Rashomon curves for categorical and real-valued datasets. All of the Rashomon curves follow the same trends as in Figure 10 (a)-(d). Most of the curves have a full Γ-shape pattern, while some (e.g. Nursery-1, Monks-2) follow the vertical trend of the Rashomon curve only as in Figure 10 (d). As discussed earlier, this trend indicates a form of simplicity of the dataset, and, indeed, Nursery-1 is separable, and others can even be separated with a single decision stump.

As we change the value of the Rashomon parameter θ, the general shape of Figure 10 is preserved across all datasets. Figure 12 shows the Rashomon curves for some of the datasets with various values of the Rashomon parameter θ. As we decrease the value of θ, the Rashomon ratio decreases, the corner of the Γ-shape becomes steeper, and finally the curve’s shape changes to the horizontal trend.

Along a Rashomon curve, there is a point that balances between simplicity and empirical error, as illustrated in Figure 11(a). We call this point the Rashomon elbow and define it as follows:

Definition 8.1.

For a hierarchy of hypothesis spaces H0H1HT the Rashomon elbow is a hypothesis space He that both minimizes the empirical risk and maximizes the Rashomon ratio:

Heargmax{Ht:L^(h|[hHt])L^(h|[hHT])}R^ratio(Ht,θ). (6)

Models on the elbow should theoretically be both accurate and simple, and therefore generalize well. As we will discuss later, the elbow model is a good choice for model selection.

Note that when comparing Rashomon sets across a hierarchy, the Rashomon ratios decrease, but the sizes of the Rashomon sets increase, because these Rashomon sets are contained within each other. The Rashomon set for decision trees of size 5 is contained within the Rashomon set of size 6, even though the Rashomon ratio at size 5 is smaller than that of size 6; the denominator of the ratio increases faster than the numerator across the hierarchy.

To summarize, the Rashomon curve seems to be a fairly universal trend across datasets, which is that as the size of the hypothesis space increases, empirical risk is decreased until the elbow is reached, after which point, the risk stays approximately constant and the Rashomon ratio rapidly decreases. We will study the curve, its elbow, and their implication to generalization in the next section.

8.2 Rashomon elbow and empirical generalization properties

The Rashomon elbow as illustrated in Figure 11(a) is a balancing point between low-error empirical performance and large Rashomon ratio. Let us now consider the generalization error for each of the hypothesis spaces that are represented as arrows on Figure 11(a). Notice that the generalization error on the elbow is the lowest among all larger-sized hypothesis spaces achieving high accuracy.

Figure 11(a) is an idealized curve that abstractly represents a trend that we observed largely, but not universally, in the data: see the Rashomon elbow generalization error for the UCI-datasets we considered in Figure 11(b-d). From the figures we can divide the datasets into three categories. For the first category (Figure 11(b)) the Rashomon elbow generalizes exactly as in Figure 11(a). We can see, for example, on the Survival dataset, that all models starting from DT-3 overfit. The second category (Figure 11(c)) shows approximately the same generalization error across all complexities of the hypothesis spaces, meaning that the elbow model is still a good choice because it yields simpler models and generalizes as well as the most complex models. The third category (Figure 11(d)) shows large generalization errors across all of the hypothesis spaces; again the elbow model is not worse than all other models considered. Thus, we seem to find that the elbow model selection criteria either helps, or has no effect, but never achieves worse performance than other possible choices.

(a)
(b)
(c)
(d)
Figure 11: (a): An illustration of the Rashomon elbow implication on generalization. For a hierarchy of hypothesis spaces, each hypothesis space is represented with a colored dot. The Rashomon elbow is shown with a blue arrow. Arrows point from the training empirical risk for each hypothesis space to the test risk, where the length of the arrow δ is the generalization error. The Rashomon elbow is illustrated to have good test performance, which is consistent with the interpretation following Theorem 4.8. (b)-(d): The generalization ability of the Rashomon elbow for selected UCI classification datasets. The hierarchy of hypothesis spaces are fully grown decision trees from depth one to seven. We display only hypothesis classes that have nonzero Rashomon ratio. The Rashomon parameter θ was set to be 0.05 for all experiments. In (b) the illustrated curve agrees with theory, in (c) all models generalize, and in (d) no models generalize.
Figure 12: Different values of the Rashomon parameter θ produce the Γ-shape trend.

The Rashomon elbow determines a useful trade-off between generalization and estimation error. Figure 11 illustrates why the Rashomon elbow model might be a good choice for model selection – it often achieves the lowest test error, as discussed in Section 4. The elbow model’s class is the smallest (simplest) among hypothesis spaces that achieves low training empirical risk. As we showed empirically in Figure 12, the location of the Rashomon elbow is not particularly sensitive to the choice of θ. For a hierarchy of fully-grown decision trees DT-D, D[1,7] Figure 10 shows the Rashomon curves. The Rashomon elbow model tends to have both high test accuracy and high Rashomon ratio, e.g., DT-6 for Monks-2, DT-3 for Banknote, DT-1 for Nursery-2 data.

Possible ways to formulate the optimization problem to find the elbow are in Appendix K.

In cases when the Rashomon ratio is complicated to compute, we can approximately find the elbow model based on changes in the empirical risk as we vary complexity of the hypothesis space. In particular, consider a hierarchy of hypothesis spaces H0H1HT and corresponding empirical risks L^(H0),L^(H1),,L^(HT) for the best models in these classes. Starting with the most complicated hypothesis space HT and decreasing the size of the hypothesis space Ht, we stop when there is a jump in the empirical risk Ht^. The hypothesis space before this significant increase in the empirical risk, which we denote as He¯, has the smallest size among Ht,He¯-1 and thus is near the top of the vertical trend of the Rashomon curve. Moreover, He¯ has low empirical risk and, therefore, is approximately at the Rashomon elbow.

8.3 Rashomon curves for ridge regression

The Rashomon curve trend holds beyond classification problems, and here we show that it holds for ridge regression as well. In particular, Figure 13 shows the Rashomon curves for a hierarchy of polynomial hypothesis spaces, where the Rashomon volume was plotted against the empirical risk. For the Rashomon volume computation, we used results of Theorem 5.1, which allows an analytical computation of the volume. The formula depends on the feature matrix, the Rashomon parameter and the number of features in the dataset. For our experiments, we choose 13 real-valued regression datasets from the UCI repository and we choose three first principal components to form new features that are not redundant with other features. Then, to create a hierarchy of hypothesis spaces, we applied a polynomial transformation to these three features for polynomials of degree 2 through degree 8. We show the empirical risk in Figure 13 as well. As in the case of classification, we see that the Rashomon curve pattern holds, the elbow model exists, and it produces competitive generalization error comparing with higher dimensional spaces.

Figure 13: The generalization ability of the Rashomon elbow for the UCI regression datasets with real features. We consider 3 features for every dataset with the largest corresponding singular values. The hierarchy of hypothesis spaces consists of polynomials of degree from one to eight. We display only hypothesis classes that have positive Rashomon ratio. The Rashomon parameter θ was set to be 0.05 for all experiments. The regularization parameter for ridge regression was set to 0.1.

Although it was possible to compute Rashomon ratios analytically for ridge regression and to estimate them through sampling for decision trees, exhaustive computation of an entire Rashomon curve may not be a practical model selection technique in most cases. We discuss more practical implications in the next section.

9 Practical implications of the Rashomon sets and ratios

We begin by recalling the main conclusions from this paper that would be most impactful for practitioners:

  • Large Rashomon sets can embed models from simpler hypothesis spaces (Section 4).

  • Similar performance across different machine learning algorithms may correlate with large Rashomon sets (Section 7).

  • Large Rashomon sets correlate with good generalization performance (Section 7, 8).

  • The size of the Rashomon set is a measure of model class complexity that trades off with training loss to form Rashomon curves (Section 8).

How can a machine learning practitioner benefit from these insights? Let us consider a researcher conducting a standard set of machine learning experiments in which the performance of several different algorithms are compared, and generalization is assessed.

Although it may may not be desirable to compute an entire Rashomon curve explicitly, some commonly occurring scenarios can give an insight into where we are on the Rashomon curve. Consider the possible scenario where all algorithms perform similarly, and when their models tend to generalize well. Since we can assess generalization ability on validation data, we can determine directly whether models generalize. We can also determine whether all the models form a diverse set from knowing the model forms that each algorithm produces. For instance, a random forest model has a different form than a single decision tree: they are both piecewise constant but forests have many more pieces. Forests and trees have a different form than a support vector machine model with gaussian kernels, which is smooth. In the case we are discussing, where the models from the various algorithms are different, they perform equally to other models that are more or less complex than themselves from a range of powerful algorithms, and they all generalize well, what we have found is that there are a large number of different well-performing models. These functions can thus constitute different members of a large Rashomon set in an embedding space F2Fsvm,F𝑏𝑜𝑜𝑠𝑡𝑖𝑛𝑔,Fforest, etc. of reasonable models. Here, 2 has limited complexity, which permits generalization of the various members of the Rashomon set, as well as other models within 2.

If, indeed, 2 exists and has the properties we claim (limited complexity class, large Rashomon set, models achieving highest test performance achievable on that dataset), then several doors open. At that point, the researcher could:

Delve in: find specialized models with specific properties, such as interpretability. If the researcher is interested in interpretable models, they can search the large Rashomon set of 2 to locate simpler models within that set. Based on the result in Section 4, such simpler models are likely to exist in a large Rashomon set of not-to-complex models.

Look up: improve generalization without losing prediction accuracy. Since all algorithms perform similarly, the algorithms could be producing models along the vertical part of a Rashomon curve of hypothesis spaces. In that case, it is worth looking higher up the Rashomon curve for simpler models that maintain the same prediction accuracy. This moves the researcher upwards along the curve towards the Rashomon elbow.

In the converse case to the one considered above, the researcher’s algorithms perform differently from each other. Based on our experiments in Section 7, our theory bestows none of the advantages listed above in this setting. There are two possible reasons for this. First, the complexity of the hypothesis spaces considered by the researcher may not be adequate for the task. The researcher thus could:

Broaden the horizon: use a more complex model class or classes. If all the algorithms perform differently, the researcher could be choosing models along the horizontal trend of the Rashomon curve. In that case, the simpler models are losing accuracy over the more complex models. This suggests that there still is room to move towards the elbow, by selecting a more complex model class that achieves better performance yet does not overfit.

A second reason for non-uniformity in performance could be that the task might benefit from specialized hypothesis spaces provided by some machine learning algorithms. For example, convolutional neural networks are particularly well-suited to certain types of vision tasks, where they outperform many general purpose machine learning algorithms. We would not expect uniformity in performance across different algorithms for such tasks.

In the cases considered above, we have shown how an understanding of the Rashomon curve can influence decisions in most cases where a researcher is exploring a data set and iteratively selecting algorithms or hypothesis spaces. As with other fundamental concepts in machine learning, such as the bias variance tradeoff, an understanding of Rashomon ratios and curves can inform practice even if such quantities are not computed explicitly but are inferred indirectly.

A perspective on modern machine learning applications and Rashomon curves

Recall that the Rashomon set is defined by the interaction between the hypothesis space and the machine learning problem. Since algorithms and benchmark datasets change over time, it should not be viewed as something that is static, but that gives insight into the state of the art at a particular point in time. Perhaps the differing perspectives that researchers have on simplicity and generalization are based on what portion of the Rashomon curve their algorithms are exploring.

At one time, the MNIST data set was considered a challenging benchmark problem, though accuracies from many modern, general purpose machine learning algorithms are all close to 100%. This suggests that the field is on the vertical part of the Rashomon curve, with no advantage left to be gained from more complex or specialized algorithms. In this case, searching within the Rashomon set for models with other desirable properties, such as simplicity or computational efficiency may be desirable. On the other hand, the newer ImageNet challenge dataset has seen increasing performance with increasingly complex model classes in recent years. Perhaps this suggests movement along the horizontal portion of the Rashomon curve towards the elbow, where further gains may yet be obtained.

The above vision examples with relatively high accuracy are in contrast with criminal recidivism prediction, where many different machine learning algorithms have essentially identical performance on many different recidivism prediction problems [48, 41, 1], where performance is measured across the full ROC curve. Recidivism prediction exemplifies the case where our theories become relevant: many different algorithms perform similarly, and very sparse interpretable models are as accurate as more complicated models. Interestingly, complicated black box models are still used in the justice system [2], despite the fact that there is little evidence to support the need for this level of complexity (see, e.g., [37]).

There is evidence that some credit risk assessment problems, and some medical problems, such as stroke risk in atrial fibrillation patients (see., e.g., [47, 29]), diabetes prediction [36], and pneumonia risk and readmission prediction [10] are in a similar state. For these problems, fully interpretable models have been derived that are approximately as accurate as the most accurate (potentially most complicated) machine learning models; perhaps these problems are on the vertical part of the Rashomon curve. Credit risk scoring leads to particularly interesting questions of accuracy-complexity trade-offs: there are financial incentives to keep credit risk models proprietary, which incentivizes the use of potentially overly complex models. Yet, a recent credit risk data set released by the Fair Isaac Corporation (FICO) for the purpose of a data mining competition yielded fully-interpretable models whose accuracy was on par with neural networks and other more complex model classes [12].

Figure 14: A possible perspective on modern machine learning and Rashomon curves: at a given point in time, the state-of-the-art for different datasets, data types, and algorithmic performance may be viewed as a location on that problem’s Rashomon curve. The maximal accuracy for each problem is different, as well as the location and shape of the Rashomon elbow. This figure is just an illustration – the values on the axes are not precise.

Figure 14 illustrates the perspective illustrated in this section: if we view current performance as a possible point on a Rashomon curve, it can be useful in determining how to proceed forward with analysis, including whether to delve in, look up, or broaden the horizon.

Conclusion and Future Work

This work studies the Rashomon set, which is the set of almost-equally-accurate models. It also studies the Rashomon ratio, which is the fraction of models that are in the Rashomon set. A large Rashomon set serves as certificate of existence for simpler-yet-accurate models, for a given dataset and hypothesis space. Although similar to complexity or simplicity concepts in machine learning, these Rashomon sets and ratios have key differences that lead to new insights. In cases where many different algorithms have similar and good performance, we hypothesize that a large Rashomon ratio may be the cause.

We also introduce the Rashomon curve, which is a function of the empirical risk versus the Rashomon ratio. The Rashomon curve follows a characteristic Γ-shape, and has occurred across more than 35 datasets that we considered. The Rashomon curve reveals a behavior of a hierarchy of hypothesis spaces: The accuracy first increases, then stabilizes, after which the Rashomon ratio decreases. The Rashomon curve’s elbow model serves as a useful trade-off between performance and simplicity, and empirically tends to generalize well.

In some cases, it may be practical to compute Rashomon ratios analytically or through sampling methods, but it is not essential to compute these quantities to benefit from the insight they provide. Clues, such as the similar performance of many different algorithms, can give insight into where a practitioner is on the Rashomon curve, and these insights can inform subsequent actions such as enriching the hypothesis space, or searching for more convenient models within the space(s) already considered.

There is also room for further work on techniques for estimating the size of the Rashomon set or ratio, either by sampling or in closed form. We provided a closed form solution for ridge regression only, but closed form solutions may exist for other hypothesis spaces. Better methods of computing or estimating these quantities may facilitate through empirical studies for additional machine learning models, which could bolster the empirical observations made in this paper. There are some challenges in calculating the Rashomon ratio discussed in Section 7, specifically that the Rashomon ratio may appear artificially smaller than it actually is, because of poor parameterization or overparameterization. As we attempt to calculate Rashomon ratios for larger function spaces, we should remember that the Rashomon ratio should be measured in functional space – Rashomon ratios computed in parameter space may not always serve as a good substitute.

Given that large Rashomon sets have these interesting properties, it would be worth exploring methods that explicitly try to (re)shape the problem to induce large Rashomon sets. Although we are not aware of any work that has directly done this, there are some existing approaches that may be re-interpreted in this way. For example, one practical technique for producing more robust classifiers is to add noise or smoothing to the training data, e.g., applying a slight blur filter to image data before training. This can be seen as flattening the optimization landscape and potentially increasing the size of the Rashomon set. It is also possible that techniques which inject noise directly into parameter space [21] could be interpreted as as having a similar effect. The fact that injecting noise into the dataset and/or optimization potentially leads to larger Rashomon sets is a possible connection to differential privacy and other types of privacy-preserving computation. One challenge is to determine whether techniques that inject noise actually do increase the Rashomon ratio. Another future direction is to revisit older techniques like that of [21], which inject noise during optimization; theoretically, if they widen the Rashomon set, they may improve performance in practice.

The theoretical results presented herein are fairly basic and often rely upon quantities that are sometimes difficult to measure in practice. There is room for further theoretical development to establish tighter and more practical bounds that follow from large Rashomon sets or ratios. One possible direction that could strengthen the connection between the theory and the observed trends in the Rashomon curve could develop along the lines explored by Shawe-Taylor [39], which considers data-dependent hierarchies of hypothesis spaces.

The connection between Rashomon sets and interpretability of models occurs in two places. First, we provided theoretical conditions under which simpler, high performing models may exist when the Rashomon set is large. Second, we hypothesize that in cases where many different algorithms perform well, a large Rashomon set containing simpler or more interpretable models may be in play. Further experimental studies and theoretical development could strengthen and give further insight into these connections.

Kurosawa’s window into human nature showed how the same event can be seen through different eyes. Decades of research on learning theory have given a variety of perspectives, such as VC theory or Rademacher complexity, on the relationship between hypothesis spaces and data sets. We have proposed Rashomon sets and ratios as another perspective on this relationship, and we have provided initial theoretical and experimental results showing that this is a unique perspective that may help explain some phenomena observed in practice.

References

  • [1] E. Angelino, N. Larus-Stone, D. Alabi, M. Seltzer, and C. Rudin (2018) Certifiably optimal rule lists for categorical data. Journal of Machine Learning Research 18, pp. 1–78. Cited by: §9.
  • [2] J. Angwin, J. Larson, S. Mattu, and L. Kirchner (2016-05) Machine bias. ProPublica. Note: Available from: Cited by: §9.
  • [3] P. L. Bartlett, O. Bousquet, S. Mendelson, et al. (2005) Local Rademacher complexities. The Annals of Statistics 33 (4), pp. 1497–1537. Cited by: §1, §2, §6, §6, §6.
  • [4] P. L. Bartlett and S. Mendelson (2002) Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research 3 (Nov), pp. 463–482. Cited by: §D.6, §D.7, §D.8, §D.9.
  • [5] O. Bousquet and A. Elisseeff (2002) Stability and generalization. Journal of machine learning research 2 (Mar), pp. 499–526. Cited by: §1, §1, §6, §6.
  • [6] L. Breiman et al. (2001) Statistical modeling: the two cultures (with comments and a rejoinder by the author). Statistical science 16 (3), pp. 199–231. Cited by: §1.
  • [7] C. J. Burges (1998) A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery 2 (2), pp. 121–167. Cited by: §E.3, §6, §6.
  • [8] X. Cai, F. Nie, H. Huang, and C. Ding (2011) Multi-class L2, 1-norm support vector machine. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pp. 91–100. Cited by: §E.3.
  • [9] F. Cao, T. Xie, and Z. Xu (2008) The estimate for approximation error of neural networks: a constructive approach. Neurocomputing 71 (4-6), pp. 626–630. Cited by: Table 3.
  • [10] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sterm, and N. Elhadad (2015) Intelligible models for healthcare: predicting pneumonia risk and hospital 30-day readmission. In Proceedings of Knowledge Discovery in Databases (KDD), pp. 1721–1730. Cited by: §9.
  • [11] P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina (2016) Entropy-SGD: biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838. Cited by: §2.
  • [12] C. Chen, K. Lin, C. Rudin, Y. Shaposhnik, S. Wang, and T. Wang (2018) An interpretable model with globally consistent explanations for credit risk. In Proceedings of NeurIPS 2018 Workshop on Challenges and Opportunities for AI in Financial Services: the Impact of Fairness, Explainability, Accuracy, and Privacy, Cited by: §9.
  • [13] B. Coker, C. Rudin, and G. King (2018) A theory of statistical inference for ensuring the robustness of scientific results. arXiv preprint arXiv:1804.08646. Cited by: §2.
  • [14] O. Davydov (2011) Algorithms and error bounds for multivariate piecewise constant approximation. In Approximation Algorithms for Complex Systems, pp. 27–45. Cited by: Table 3.
  • [15] R. A. DeVore (1998) Nonlinear approximation. Acta numerica 7, pp. 51–150. Cited by: Table 3.
  • [16] D. Dheeru and E. Karra Taniskidou (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §1, §7, §8.
  • [17] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio (2017) Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933. Cited by: Figure 15, §2, §5.
  • [18] A. Fisher, C. Rudin, and F. Dominici (2018) Model class reliance: variable importance measures for any machine learning model class, from the ”Rashomon” perspective. arXiv preprint arXiv:1801.01489. Cited by: §2.
  • [19] D. Galvin (2014) Three tutorial lectures on entropy and counting. arXiv preprint arXiv:1406.7872. Cited by: Appendix C.
  • [20] M. Grötschel, L. Lovász, and A. Schrijver (2012) Geometric algorithms and combinatorial optimization. Vol. 2, Springer Science & Business Media. Cited by: §E.2.
  • [21] S. Hochreiter and J. Schmidhuber (1997) Flat minima. Neural Computation 9 (1), pp. 1–42. Cited by: §2, Conclusion and Future Work.
  • [22] S. M. Kakade, K. Sridharan, and A. Tewari (2009) On the complexity of linear prediction: risk bounds, margin bounds, and regularization. In Advances in neural information processing systems, pp. 793–800. Cited by: §2.
  • [23] R. Kannan, L. Lovász, and M. Simonovits (1997) Random walks and an O*(n5) volume algorithm for convex bodies. Random Structures & Algorithms 11 (1), pp. 1–50. Cited by: §E.2.
  • [24] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang (2016) On large-batch training for deep learning: generalization gap and sharp minima. arXiv preprint arXiv:1609.04836. Cited by: §2.
  • [25] V. Koltchinskii, D. Panchenko, et al. (2002) Empirical margin distributions and bounding the generalization error of combined classifiers. The Annals of Statistics 30 (1), pp. 1–50. Cited by: §2.
  • [26] A. Kurosawa (1950) Rashomon. Tokyo: Daiei. Cited by: §1.
  • [27] J. Langford and J. Shawe-Taylor (2003) PAC-bayes & margins. In Advances in neural information processing systems, pp. 439–446. Cited by: §2.
  • [28] G. Lecué (2011) Interplay between concentration, complexity and geometry in learning theory with applications to high dimensional data analysis. Ph.D. Thesis, Université Paris-Est. Cited by: §2.
  • [29] B. Letham, C. Rudin, T. H. McCormick, and D. Madigan (2015) Interpretable classifiers using rules and bayesian analysis: building a better stroke prediction model. Annals of Applied Statistics 9 (3), pp. 1350–1371. Cited by: §9.
  • [30] G. Lugosi, A. B. Nobel, et al. (1999) Adaptive model selection using empirical complexities. The Annals of Statistics 27 (6), pp. 1830–1864. Cited by: §2.
  • [31] G. Lugosi, M. Wegkamp, et al. (2004) Complexity regularization via localized random penalties. The Annals of Statistics 32 (4), pp. 1679–1697. Cited by: §2.
  • [32] F. J. MacWilliams and N. J. A. Sloane (1977) The theory of error-correcting codes. Vol. 16, Elsevier. Cited by: Appendix C.
  • [33] S. Mendelson (2003) A few notes on statistical learning theory. In Advanced lectures on machine learning, pp. 1–40. Cited by: §2.
  • [34] D. Newman and T. Rivlin (1976) Approximation of monomials by lower degree polynomials. aequationes mathematicae 14 (3), pp. 451–455. Cited by: Table 3.
  • [35] R. Paturi (1992) On the degree of polynomials that approximate symmetric boolean functions (preliminary version). In Proceedings of the twenty-fourth annual ACM symposium on Theory of computing, pp. 468–474. Cited by: Table 3.
  • [36] N. Razavian, S. Blecker, A. M. Schmidt, A. Smith-McLallen, S. Nigam, and D. Sontag (2015) Population-level prediction of type 2 diabetes from claims data and analysis of risk factors. Big Data 3 (4), pp. 277–287. Cited by: §9.
  • [37] C. Rudin, C. Wang, and B. Coker (2019) The age of secrecy and unfairness in recidivism prediction. Harvard Data Science Review. Note: (accepted) Cited by: §9.
  • [38] R. E. Schapire, Y. Freund, P. Bartlett, W. S. Lee, et al. (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. The annals of statistics 26 (5), pp. 1651–1686. Cited by: §2, §6.
  • [39] J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony (1998) Structural risk minimization over data-dependent hierarchies. IEEE transactions on Information Theory 44 (5), pp. 1926–1940. Cited by: §2, Conclusion and Future Work.
  • [40] N. Srebro, K. Sridharan, and A. Tewari (2010) Smoothness, low noise and fast rates. In Advances in neural information processing systems, pp. 2199–2207. Cited by: §2, §2.
  • [41] N. Tollenaar and P.G.M. van der Heijden (2013) Which method predicts recidivism best?: a comparison of statistical, machine learning and data mining predictive models. Journal of the Royal Statistical Society: Series A (Statistics in Society) 176 (2), pp. 565–584. Cited by: §9.
  • [42] T. Tulabandhula and C. Rudin (2013) Machine learning with operational costs. Journal of Machine Learning Research 14, pp. 1989–2028. External Links: Link Cited by: §2.
  • [43] T. Tulabandhula and C. Rudin (2014) On combining machine learning with decision making. Machine Learning (ECML-PKDD journal track) 97 (1-2), pp. 33–64. Cited by: §2.
  • [44] V. N. Vapnik (1995) The nature of statistical learning theory. Springer. Cited by: §1, §1.
  • [45] V. Vapnik and A. Y. Chervonenkis (1971) On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications 16 (2), pp. 264. Cited by: §1, §2, §6, §6.
  • [46] C. S. Wallace and D. M. Boulton (1968) An information measure for classification. The Computer Journal 11 (2), pp. 185–194. Cited by: §2.
  • [47] H. Yang, D. E. Singer, A. S. Go, K. Reynolds, and C. Rudin (2019) A machine learning approach to predicting stroke in atrial fibrillation. Submitted. Cited by: §9.
  • [48] J. Zeng, B. Ustun, and C. Rudin (2017) Interpretable classification models for recidivism prediction. Journal of the Royal Statistical Society: Series A (Statistics in Society) 180 (3), pp. 689–722. Cited by: §9.
  • [49] D. Zhou (2002) The covering number in learning theory. Journal of Complexity 18 (3), pp. 739–767. Cited by: §2.
  • [50] J. Zhu, S. Rosset, R. Tibshirani, and T. J. Hastie (2004) 1-norm support vector machines. In Advances in neural information processing systems, pp. 49–56. Cited by: §E.3.

Appendix A List of notations

Please refer to the Table 4 for the list of notations and symbols used in the paper.

Notation Description
n number of points in a dataset
p number of features
S training dataset
𝒟 unknown data distribution
𝒵=𝒳×𝒴 data space; 𝒳 – input space; 𝒴 – output space
hypothesis space
ρ() prior distribution over
d(,) metric, defined on a metric space
p p-norm defined for elements of
ϕ(f(x),y) loss function
l(f,z) loss function
ζ(f,z) perfomance function
L true risk
f* optimal model
L^ empirical risk
f^ empirical risk minimizer
θ Rashomon parameter
R^set(,θ) Rashomon set
𝒱(R^set(,θ)) Rashomon volume
R^ratio(,θ) Rashomon ratio
R^ratiopat(,θ) pattern Rashomon ratio
γ anchored Rashomon parameter
Rsetanc(,γ) true anchored Rashomon set
R^setanc(,γ) anchored Rashomon set
R^ratioanc(,γ) anchored Rashomon ratio
Rratioanc(,γ) true anchored Rashomon ratio
K Lipschitz constant
Bδ(f) δ-ball centered at f
(,δ) packing number
𝒩(,δ) covering number
Rn Rademacher complexity
R^nS empirical Rademacher complexity computed on a dataset S
Ω parameter space
Ω parameterized hypothesis space
ω^ parameter in the parameter space that minimized empirical risk
C regularization parameter
σi singular values of the feature matrix
DT-D fully grown decision tree of depth D
𝒜 a learning algorithm
Table 4: List of symbols and notation used in this paper

Appendix B Rashomon volume and ϵ-flatness

Please see Figure 15 to understand the difference between the Rashomon volume and ϵ-flatness.

(a) The ϵ-flatness
(b) The Rashomon set
Figure 15: Difference between ϵ-flatness as defined in [17] and the Rashomon volume. Red lines represent (a) ϵ-flatness, (b) the Rashomon volume. The height of the shaded area represents (a) the parameter ϵ or the 2σ-sharpness, (b) the Rashomon parameter θ. The ϵ-flatness is defined by a connected component in a parameter space for a given local minimum, while the Rashomon set is defined with respect to an empirical risk minimizer over the full hypothesis space and may contain models from multiple local minima. Rashomon sets are also defined for discrete spaces.

Appendix C Proof of Statement 3.1

Proof.

Assume the model class becomes arbitrarily flexible, then at some value of D, each possible labeling of points (each pattern) will constitute a separate equivalence class. Then, the total number of all possible patterns, given that we have two classes, will be 2n. Also, since each possible pattern is realized, there will be one pattern that achieves the best possible accuracy, 100%. Given the Rashomon parameter θ, a classification pattern should produce an accuracy of at least 1-θ in order for its equivalence class of functions to be in the Rashomon set. Therefore, the Rashomon set can tolerate at most θn points to be misclassified, which leads to the pattern Rashomon ratio limit R^ratiopat(,θ)i=0θn(ni)2n.

We obtain the upper bound for R¯pat based on iθn(ni)2H(θ)n for any fixed θ1/2 [19]. The lower bound for R¯pat follows from simple observations i=0θn(ni)(nθn) and (nθn)2nH(θ)8nθ(1-θ) [32].

Appendix D Proofs for generalization results

D.1 Proof of Theorem 4.1

From the definition of the true anchored Rashomon set, it follows that any model in it is γ-close to the ERM. Simple observation shows that any model in the true anchored Rashomon set is also γ-close to any other model in it and is summarized is the lemma below.

Lemma D.0.1.

For any models f,fF that are in the true anchored Rashomon set Rsetanc(F,γ) we have |L(f)-L(f)|γ.

Proof.

Consider two models f and f from the true anchored Rashomon set Rsetanc(,γ). Let L(f)=γ and L(f)=γ′′. Then if γ>γ′′: L(f)-L(f)=γ-γ′′γγ, otherwise L(f)-L(f)=γ′′-γγ′′γ. Combining these inequalities, we get the statement of the lemma. ∎

Now we provide the proof of Theorem 4.1

Proof.

We apply the union bound and Hoeffding’s inequality. The result is that with probability at least 1-ϵ for every f11 we have, for a finite hypothesis space 1:

|L(f1)-L^(f1)|2blog|1|+log2/ϵ2n. (7)

Combining this Occham’s razor bound with the definition of f2*argminf2L(f) we get that, under the same conditions:

L(f2*)L(f^1)L^(f^1)+2blog|1|+log2/ϵ2n.

By assumption of the theorem, there exists a function f~11 such that f~1Rsetanc(2,γ). Since f2* is an optimal model, then f2*Rsetanc(2,γ) as well. From Lemma D.0.1, |L(f2*)-L(f~1)|γ, which implies L(f~1)L(f2*)+γ. Given that f1^argminf1L^(f), and using (7), we get that with probability at least 1-ϵ, we have:

L^(f^1)L^(f~1)L(f~1)+2blog|1|+log2/ϵ2nL(f2*)+γ+2blog|1|+log2/ϵ2n.

Combining the previous two equations together we have:

|L(f2*)-L^(f^1)|γ+2blog|1|+log2/ϵ2n.

D.2 Proof of Theorem 4.2

Proof.

The true anchored Rashomon set Rsetanc(2,γ) has Rratioanc(1,γ)|2| models. The probability that at least one of these models is from the hypothesis space 1 is: p=1-((1-Rratioanc(2,γ))|2||1|)(|2||1|). In the fraction, the numerator is the number of ways we could randomly select |1| models that are outside of the Rashomon set, whereas the denominator is the total number of ways we can select |1| models from |2| at random.

Now with probability p=1-((1-Rratioanc(2,γ))|2||1|)(|2||1|) we can guarantee that the hypothesis space 1 will contain at least one model from the true anchored Rashomon set, therefore by using Theorem 4.1 we get the statement of Theorem 4.2.

Simplifying the the binomial coefficients we get that:

1-p=((1-Rratioanc(2,γ))|2||1|)(|2||1|)=((1-Rratioanc(2,γ))|2|)!|1|!(|2|-|1|)!|1|!((1-Rratioanc(2,γ))|2|-|1|)!|2|!=((1-Rratioanc(2,γ))|2|)!|2|!(|2|-|1|)!((1-Rratioanc(2,γ))|2|-|1|)!=i=1Rratioanc(2,γ)|2|(1-Rratioanc(2,γ))|2|-|1|+i(1-Rratioanc(2,γ))|2|+i=i=1Rratioanc(2,γ)|2|(1-|1|(1-Rratioanc(2,γ))|2|+i)=i=1|Rsetanc(2,γ)|(1-|1||2|-|Rsetanc(2,γ)|+i).

Therefore, alternatively p=1-i=1|Rsetanc(2,γ)|(1-|1||2|-|Rsetanc(2,γ)|+i), which is an easier expression to compute in practice, especially for large values of |2|.

D.3 Proof of Theorem 4.3

Theorem 4.3 follows directly from Lemma 4.2.1 and Theorem 4.1, which guarantees that with high probability the sampled space 1 will contain at least one model from the true anchored Rashomon set. Now we will provide a proof of Lemma 4.2.1.

Proof.

The probability of an individual sample from 2 missing the true anchored Rashomon set is 1-Rratioanc(2,γ). The probability if this happening |1| times independently is (1-Rratioanc(2,γ))|1|. Thus, for any ϵ>0, if the Rashomon ratio is at least Rratioanc(2,γ)1-ϵ1|1|, the probability pw of sampling, with replacement, at least one hypothesis from Rratioanc(2,γ) is:

pw=1-(1-Rratioanc(2,γ))|1|1-ϵ.

Let pi be the probability, under sampling without replacement, that samples 1i have missed Rratioanc(2,γ). p1=1-Rratioanc(2,γ), and pi(1-Rratioanc(2,γ))i. The probability, under sampling without replacement, that at least one hypothesis from Rratioanc(2,γ) in 1 is therefore 1-p|1|pw. Thus the the statement of the lemma holds with the probability at least 1-ϵ. ∎

D.4 Proof of Statement 4.1

Proof.

For a fixed fR^setanc(,γ), by Hoeffding’s inequality:

P[L^(f)-L(f)<-ϵ]=P[1ni=1nl(f,zi)-𝔼[l(f,z)]<-ϵ]e-2n(ϵ/b)2.

Therefore, with probability at least 1-e-2n(ϵ/b)2 with respect to the random draw of data, L(f)-L^(f)ϵ.

Since fR^setanc(,γ), then by definition of the Rashomon set, L^(f)γ. Combining this with Hoeffding’s inequality, with probability at least 1-e-2n(ϵ/b)2:

L(f)L^(f)+ϵγ+ϵ,

therefore fRsetanc(,γ+ϵ).

D.5 Proof of Statement 4.2

Proof.

For a fixed fRsetanc(,γ) by Hoeffding’s inequality:

P[L^(f)-L(f)>ϵ]=P[1ni=1nl(f,zi)-𝔼[l(f,z)]>ϵ]e-2n(ϵ/b)2.

Therefore, with probability at least 1-e-2n(ϵ/b)2 with respect to the random draw of data, L^(f)-L(f)ϵ.

Since fRsetanc(,γ), then by definition of the Rashomon set, L(f)γ. Combining this with Hoeffding’s inequality, we get that with probability at least 1-e-2n(ϵ/b)2:

L^(f)L(f)+ϵγ+ϵ,

therefore fR^setanc(,γ+ϵ).

D.6 Proof of Theorem 4.4

Proof.

The first result follows directly from the definition of the Rashomon set and Lipschitz continuity:

L^(f¯1)-L^(f^2)=|L^(f¯1)-L^(f^2)|Kf¯1-f^2p=KθK=θ.

Using Bartlett and Mendelson’s generalization bound for Lipschitz loss functions [4] we have that for every model f11, with probability greater that 1-ϵ, |L(f1)-L^(f1)|2KRn(1)+blog(2/ϵ)2n. Since f¯11, the bound holds for it as well. ∎

D.7 Proof of Theorem 4.5

Proof.

Consider a ball Bδ(f2) of radius δ centered at f2 that is contained within the Rashomon set. By the theorem’s assumption, since f2R^set(2,θ), there exists f¯11 such that f2-f¯1pδ. Therefore f¯1 is inside the δ-ball f¯1Bδ(f2) and thus belongs to the Rashomon set R^set(2,θ).

The generalization bound follows Bartlett and Mendelson [4] as before. ∎

D.8 Proof of Theorem 4.6

Proof.

Starting from the packing number of the Rashomon set (R^set(2,θ),2δ), there exists a 2δ-packing Ξ={ξ1,,ξk|ξiR^set(2,θ)} such that ξi-ξjp>2δ. On the other hand, for each ξiR^set(2,θ) there exists f¯1i1 such that ξi-f¯1ipδ. Therefore for each ball center ξi in the packing there is a distinct model f¯1i from the simpler hypothesis space 1. Thus, the Rashomon set contains at least B=(R^set(2,θ),2δ) models from 1.

The generalization bound follows Bartlett and Mendelson [4] as before. ∎

D.9 Proof of Theorem 4.7

Proof.

Follows from the triangle inequality, the theorem’s statement, and Bartlett and Mendelson’s generalization bound for K-Lipschitz loss functions [4]:

|L(f2)-L^(f2)||L(f2)-L(f1)|+|L^(f2)-L^(f1)|+|L(f1)-L^(f1)|Kδ+Kδ+2KRn(1)+blog(2/ϵ)2n

D.10 Proof of Theorem 4.8

Proof.

For a given Rashomon parameter θ consider the smallest δ-ball centered at the empirical risk minimizer f^ that contains the Rashomon set, then its radius is δθ=maxfR^set(,θ)f-f^p.

Separately, let H be a δθ-cover of with minimal cardinality:

hHBδθ(h),

or alternatively:

 f  hH such that f-h<δθ.

Let 𝒩(,δθ) be the number of elements in the cover. Since it is determined by a minimum cardinality cover of , it is therefore a δθ-covering number.

By using the union bound over the cover and Hoeffding’s inequality we get that

PS𝒟n[hl{h1,..,h𝒩(,δθ)}:|L(hl)-L^(hl)|θ]l=1𝒩(,δθ)PS𝒟n(|L(hl)-L^(hl)|θ)l=1𝒩(,δθ)2e-2n(θ/b)2=2𝒩(,δθ)e-2n(θ/b)2. (8)

For empirical risk minimizer f^, given a cover H there exists a ball center h^ such that it is nearby: f^-h^pδθ. By the Rashomon assumption in the statement of the theorem, a ball of size δθ contains the Rashomon set R^set(,θ). Therefore for any model f from the Rashomon set, f-f^pδθ. Combining this with the h^ closest to f^ we get:

f-h^p=f-f^-h^+f^pf-f^p+f^-h^p2δθ.

Since the loss is K-Lipschitz, then both the true and the empirical risk will be Lipschitz with the same constant due to linearity of expectation. Therefore we have:

|L^(f)-L^(h^)|Kf-h^p2Kδθ, (9)
|L(f)-L(h^)|Kf-h^p2Kδθ. (10)

If all of the hl’s generalize, that is, |L(hl)-L^(hl)|θl, then h^ must generalize, |L(h^)-L^(h^)|θ, in which case we can combine with (9) and (10) to see that

|L(f)-L^(f)| = |L(f)-L^(f)+L(h^)-L(h^)+L^(h^)-L^(h^)|
|L(f)-L(h^)|+|L(h^)-L^(h^)|+|L^(f)-L^(h^)|
4Kδθ+θ.

Thus, the probability of |L(f)-L^(f)|4Kδθ+θ must be at least the probability of |L(hl)-L^(hl)|θl. Conversely, the event that there is an l such that |L(hl)-L^(hl)|>θ must have a greater probability than the event that |L(f)-L^(f)|>4Kδθ+θ. This statement, combined with (8) yields:

PS𝒟n[|L(f)-L^(f)|>4Kδθ+θ]
PS𝒟n[hl{h1,..,h𝒩(,δθ)}:|L(hl)-L^(hl)|θ]
2𝒩(,δθ)e-2n(θ/b)2,

that gives us the statement of the theorem.

Appendix E Approximation of the Rashomon ratio and volume

E.1 Rashomon volume for ridge regression

E.1.1 Visualization of the Rashomon set for ridge regression

Consider least squares regression, which is a corner case of ridge regression with the regularization constant C=0. Figure 16 shows the plot of the empirical risk in two dimensional parameter space. Visually, the Rashomon set consists of the those parameters ω=[ω1,ω2] that produces a loss below the dark shaded ellipse on the paraboloid, and then Rashomon volume can be computed exactly as the volume of the shaded ellipsoid.

Figure 16: The Rashomon set for the two-dimensional least squares regression. The volume of a shaded ellipsoid corresponds to the Rashomon volume in a parameter space.

E.1.2 Proof of Theorem 5.1

Proof.

Consider all models fωΩ from the Rashomon set R^set(Ω,θ). Then by Definition 3.1 we get:

L^(X,Y,ω)L^(X,Y,ω^)+θ. (11)

Using XTY=(XTX+CIp)ω^ from the optimal solution of the ridge regression estimator ω^=(XTX+CIp)-1XTY, and expanding the difference between empirical risks we have:

θL^(X,Y,ω)-L^(X,Y,ω^)=(Xω-Y)T(Xω-Y)+CωTω-(Xω^-Y)T(Xω^-Y)-Cω*Tω^=ωTXTXω-2ωTXTY+CωTω-ω*TXTXω^+2ω*TXTY-Cω*Tω^=ωTXTXω-2ωT(XTX+CIp)ω^+CωTω-ω*TXTXω^+2ω*T(XTX+CIp)ω^-Cω*Tω^=ωTXTXω+CωTω-2ωT(XTX+CIp)ω^+ω*TXTXω^+Cω*Tω^=ωT(XTX+CIp)ω-2ωT(XTX+CIp)ω^+ω*T(XTX+CIp)ω^=(ω-ω^)T(XTX+CIp)(ω-ω^).

Therefore the Rashomon set is an ellipsoid centered at ω^:

(ω-ω^)TXTX+CIpθ(ω-ω^)1.

By the formula of the volume of a p-dimensional ellipsoid the Rashomon volume can be computed as:

𝒱(R^set(Ω,θ))=πp/2θp/2Γ(p/2+1)i=1p1σi2+C,

where σi are singular values of X. ∎

E.1.3 Rashomon volume lower bounds for ridge regression

The results described in this appendix follow directly from the Theorem 5.1 and are basic observations of the Rashomon volume closed-form formula for the ridge regression.

Corollary E.0.1.

For a dataset S=X×Y such that a Frobenius norm of the feature matrix X is bounded XF=i,jp,nxij2[1,F] and for a parametric hypothesis space of linear models FΩ={fω(x)=ωTx,ωp}, the Rashomon volume of ridge regression is at least

𝒱(R^set(,θ))2K(θ,p)F+pC.
Proof.

For real ai0, i=1,..,p we have that (i=1pai)1/q(i=1pai)/q, then by by setting q=2 and ai=σi2+C we get: (i=1p(σi2+C))1212(i=1p(σi2+C))=12i=1p(σi2+pC)=12(XF+pC)12(F+pC), therefore from the Theorem 5.1 we have that 𝒱(R^set(,θ))2K(θ,p)F+pC.

Corollary E.0.2.

For a dataset S=X×Y and a parametric hypothesis space of linear models FΩ={fω(x)=ωTx,ωp}, if 2L^ωj2δ, such that δ2C, then the Rashomon volume of ridge regression is at least

𝒱(R^set(,θ))2K(θ,p)p(δ2-C)+pC.
Proof.

As in previous Corollary we can bound the singular values product with the Frobenius norm of the feature matrix (i=1p(σi2+c))1212(XF+pc). Given the bounded second derivative we have 2L^ωj2=2ijxij2+2Cδ. By the assumption δ2C we get that ijxij2δ2-C and therefore we can upper bound the Frobenius norm as follows: XF=j(ixij2)j(δ2-C)=p(δ2-C). Taking into account the Theorem 5.1 𝒱(R^set(,θ))2K(θ,p)p(δ2-C)+pC.

Corollary E.0.3.

For a dataset S=X×Y, such that xi are on a unit sphere i:xi=1 and a parametric hypothesis space of linear models FΩ={fω(x)=ωTx,ωp}, the Rashomon volume of ridge regression is at least

𝒱(R^set(,θ))2K(θ,p)n+pC
Proof.

As in previous Corollaries we can bound the singular values product with the Frobenius norm of the feature matrix (i=1p(σi2+c))1212(XF+pc). Since XF=i(jxij2)=i1=n, then i=1nσi2+cn+pc2, and combined with the Theorem 5.1 we get that 𝒱(R^set(,θ))2K(θ,p)n+pC.

E.2 Convex loss

For a parametric hypothesis space where the loss l(ω,z) is convex with respect to the parameter vector ω, the Rashomon set as well as the hypothesis space, are convex as well, and we can use random walks to estimate their volumes. In particular, according to Theorem 2.1 by Kannan et al.[23], there exists a randomized algorithm that can approximate, with high probability, the volume of a convex body Vp within an ϵ error using approximately O(p5) calls to a separating oracle. In particular we can approximate the volume 𝒱^(V) such that:

(1-ϵ)𝒱^(V)<𝒱(V)<(1+ϵ)𝒱^(V). (12)

To adapt the randomized algorithm theorem for Rashomon set estimation in the parameter space Ω, we need to construct a separating oracle [20]: a routine that, for a given point λ and convex set Λ, tells us whether λΛ, and if not, provides a separating hyperplane between λ and Λ. From the Rashomon set definition, given a parameter vector ω, we check if fω belongs to the Rashomon set according to L^(fω)L^(fω^)+θ. If fω is not in the Rashomon set, we construct a separating hyperplane in parameter space Ω using the perpendicular to the tangent hyperplane. In particular, since the loss is convex, a tangent hyperplane at point ω looks like:

l(ω,)(γ-ω)=0.

Let ωpr=PRR^set(Ω,θ)(ω) be a projection of point ω onto the Rashomon set. Then, we derive a separating hyperplane to be in the middle of ω and its projection:

l(ω,)(γ-ω)+(ω-ωpr)/2=0.

Applying the constructed separating oracle to the randomized algorithm theorem, with high probability, we can achieve approximation guarantees for the Rashomon volume given in (12).

E.3 Rashomon volume under-approximation for SVM-1

In this section we propose an optimization procedure that allows to under-approximate the Rashomon volume in the parameter space for the Support Vector Machine (SVM)[7] with L-1 regularization.

Consider binary classification for the class of linear models. Directly computing the Rashomon ratio requires sampling over an infinite space of linear functions. Instead we propose a procedure that allows us to compute L-1 ball that is contained within the Rashomon set for SVM-1 [8, 50] learning algorithm.

Theorem E.1.

For the SVM-1 with hinge loss ϕ(f(x),y)=1-yf(x)+ and L-1 regularization norm and for parameterized hypothesis space of linear models F={ωTx, ωp} the Rashomon volume is at least

𝒱(R^set(,θ))2pδpp!,

where δ=ωv-ωc2 and ωv satisfies minωvR^set(Fω,θ)ωv-ωc1.

Proof.

Assume that ωc is an optimal solution to the 1-norm SVM problem:

minωcωc1+i=1n1-yifωc(xi)+.

By solving following optimization problem

minωvωv-ωc1, s.t. i=1n1-yifωv(xi)+θ,

we get ωvR^set(ω,θ) that is the closest to ωc in the parameter space. The volume of a cross-polytope with a half-diagonal d is given by 2pdpp!. Since the Rashomon volume at least contains the L-1 ball with a half-diagonal δ=ωv-ωc2 we get the statement of the theorem.

Appendix F Proofs for connection to statistical learning theory

F.1 Proof of Theorem 6.1

Proof.

Consider the least squares regression minωi=1nl(ω,𝐳i)2, where ωp, and loss l(ω,𝐳)=ϕ(ωT𝐱,𝐲) for 𝐳=(𝐱,𝐲). For the marginal distribution PX and 𝐗=[𝐱𝟏,,𝐱𝐧] drawn i.i.d. from PX we design distributions PY1|𝐗 and PY2|𝐗 as:

PY1|𝐗(y=𝟎|𝐱)=1𝐱𝐗,
PY2|𝐗(y=𝟎|𝐱𝐱𝟎)=1, PY2|𝐗(y=𝟎|𝐱=𝐱𝟎)=0.5, PY2|𝐗(y=𝐇|𝐱=𝐱𝟎)=0.5,

where 𝐱𝟎{x1,,xN} is some fixed point with a positive probability PX(𝐱𝟎) and we define 𝐇 later.

According to the definition of algorithmic stability, for PX,Y1 we have:

𝔼S1,z[|l(f𝐒1,𝐳)-l(f𝐒1i,𝐳)|]=0=β1~,

and for distribution PX,Y2:

𝔼S2,z[|l(f𝐒2,𝐳)-l(f𝐒2i,𝐳)|]=𝐒2,𝐳PX,Y2PX,Y2(𝐒2)PX,Y2(𝐳)|l(f𝐒2,𝐳)-l(f𝐒2i,𝐳)|PX,Y2(𝐒2s)PX,Y2(𝐳s)|l(f𝐒2s,𝐳s)-l(f𝐒2s,i,𝐳s)|,

where 𝐒2s,𝐳s is a special draw such that 𝐳s=(𝐱𝟎,𝐇) and 𝐒2s contains both (𝐱𝟎,𝐇) and (𝐱𝟎,𝟎). Since the domain 𝒳 is discrete, the probabilities of a special draw are:

PX,Y2(𝐳s)=12Bin(1,n,PX(𝐱𝟎)),PX,Y2(𝐒2s)=14Bin(1,n,PX(𝐱𝟎))2Bin(n-2,n,1-PX(𝐱𝟎)),

where Bin(k,n,pk)=(nk)pkk(1-pk)(n-k) is a binomial coefficient, namely a probability of getting exactly k successes from n trials, where each trial has a probability of success pk. Denote P(𝐒2s,𝐳s) as the probability of getting a special draw, then P(𝐒2s,𝐳s)=PX,Y2(𝐒2s)PX,Y2(𝐳s).

If 𝐒2s contains only two points (𝐱𝟎,𝐇) and (𝐱𝟎,𝟎), the loss difference |l(f𝐒2s,𝐳s)-l(f𝐒2s,i,𝐳s)| evaluated at 𝐳s for all i will be at least 𝐇24. As we add more points (𝐱i,𝟎) to the dataset 𝐒2s the loss difference in the special draw case will only increase. Therefore for all i:

|l(f𝐒2s,𝐳s)-l(f𝐒2s,i,𝐳s)|𝐇24.

If we choose 𝐇 such that 𝐇>2λ(P(𝐒2s,𝐳s))-1/2, then from the definition of algorithmic stability we have:

β2~𝔼S2,z[|l(f𝐒2,𝐳)-l(f𝐒2i,𝐳)|]P(𝐒2s,𝐳s)𝐇24>λ.

Therefore for any given λ we get that |β1~-β2~|>λ.

On the other hand, the Rashomon volume for the hypothesis space Ω of linear models does not depend on targets and can be calculated as in (5) for both 𝐒1 and 𝐒2. Therefore the expected Rashomon volumes are the same:

𝔼S1[𝒱(Rset𝐒1(Ω,θ))]=𝔼S2[𝒱(Rset𝐒2(Ω,θ))].

Given the equality of the expected Rashomon volumes the the expected Rashomon ratios are the same as well as we do not change the hypothesis space when drawing different joint distribution.

F.2 Proof of Theorem 6.2

Proof.

Consider two-dimensional separable data, 𝒳[0,1]2, and a parametrized hypothesis class of origin-centered linear models: ={ωTx,ω=(k,-1),x2,k}. Consider also 0-1 loss ϕω(x,y)=𝟙[y=sign(ωTx)] and an empirical risk minimizer f^=fω^ that maximizes the geometric margin. Since the data are populated in a [0,1]2 hypercube, as a hypothesis space we will consider all models that intersect the unit-hypercube.

For some positive constant a(0,1) that we choose later, consider following regions of the feature space:

A={x1[0,1-a),x2>x1+(1-2a)},B={x1(a,1],x2<x1-(1-2a)},
C={x1[0,a),x2(1-a,1]},D={x1(1-a,1],x2[0,a)}.

Construct dataset S1, such that S1=(xA,1)(xB,-1)(xS1s1,1)(xS1s2,-1), where xAA is any sample from the region A, xBB is any sample from the region B, xS1s1 and sS1s2 are special points for the dataset S1 such that xS1s1=[1-2a,1] and xS1s2=[1,1-2a]. Please see Figure 16(a) for details.

Construct dataset S2, such that S2=(xC,1)(xD,-1)(xS2s1,1)(xS2s2,-1), where xCC is any sample from the region C, xDD is any sample from the region D, xS2s1 and xS2s2 are special points for the dataset S2 such that xS2s1=[a,1-a] and xS2s2=[1-a,a]. Please see Figure 16(b) for details.

Note, that datasets we considered have the same width of the geometrical margin d=2(2a-1) (see Figures 16(a), 16(b)). Now, we are left to show that the Rashomon ratios are different.

For the functional space of origin-centered lines we have a unique parametrization and a one-to-one correspondence between an actual model and its parametrization. Therefore, if the Rashomon set is a single connected component, an angle α between the two most distant models in the Rashomon set gives us some information about the Rashomon volume. In particular, we can compute the Rashomon ratio as a ratio of the angle α that represents the Rashomon set and the angle β that corresponds to the hypothesis space as shown on Figure 16(c). Since the hypothesis space is defined on the unit-hypercube, β=π/2 and for the Rashomon parameter θ=0 the Rashomon ratio is:

R^ratio(,0))=αβ=2maxfR^set(Ω,0)|arctan(fω^)-arctan(fω)|π/2.

For datasets S1 and S2 Figures 16(d) and 16(e) show the Rashomon set and angles α1 and α2 that represent the Rashomon volume. Given the special points in the datasets we can compute α1 and α2 exactly: α1=2(arctan(1)-arctan(1-2a))=π2-2arctan(1-2a) and α2=2(arctan(1)-arctan(a1-a))=π2-2arctan(a1-a). Then the Rashomon ratios difference is:

|RratioS1(,0)-RratioS2(,0)|=|α1-α2π/2|=|4π(arctan(1-2a)-arctan(a1-a))|=|4πarctan(1-4a-22a2-1)|.

Now if we choose a(0,1) and such that |4πarctan(1-4a-22a2-1)|>λ, then the Rashomon ratio difference |RratioS1(,0)-RratioS2(,0)| is at least λ.

(a) Structure of S1
(b) Structure of S2
(c) R^ratio(Ω,θ)=αβ
(d) R^ratioS1(Ω,0)=α1π/2
(e) R^ratioS2(Ω,0)=α2π/2
Figure 17: An illustration of different Rashomon ratios with identical geometric margins. The black line shows the optimal model, the shaded region indicates the Rashomon set R^set(Ω,0) with its boundaries represented by green lines, the dark color indicates boundaries of the hypothesis space. (a) and (b) show the datasets S1 and S2 with identical margin d. (c) shows that the Rashomon ratio can be computed as a ratio of angles α (represents the Rashomon set) and β (represents the hypothesis space). (d) and (e) illustrate that datasets S1 and S2 are represented by different angles α1 and α2 and therefore have different Rashomon ratios. Best seen in color.

F.3 Proof of Theorem 6.3

Proof.

Consider two-dimensional separable symmetric data, 𝒳[0,1]2, 𝒴={0,1}, 0-1 loss ϕf(x,y)=𝟙[y=signf(x)] with empirical risk minimizer f^, and a hypothesis class Ω of decision stumps based on the first feature, where for fΩ: f=1 if x1>ω, ω, f=0 otherwise. We have a one-to-one correspondence between a function and its threshold parameter ω. Therefore, if the Rashomon set is a single connected component, we can compute the Rashomon volume in a parameter space by computing the difference between the largest and smallest threshold values of models within the Rashomon set, as illustrated in Figure 17(a). The difference between the largest and the smallest threshold values will be equivalent to the minimal distance between points of opposite classes projected onto the first feature d=minxi,xj:yiyj|PR1(xi)-PR1(xj)|.

For the hypothesis space we consider all models that intersect a unit-hypercube [0,1]2, where data are populated. The difference in thresholds for the hypothesis space is β=1 and therefore 𝒱((Ω)=1. For θ=0 the Rashomon volume will be equivalent to d – the projected minimal distance between points of opposite classes, and have that 𝒱(R^set(Ω,0))=d and R^ratio(R^set(Ω,0))=d1=d. Now consider any two separable symmetric datasets S1, S2 with different projected minimal distances d1 and d2, such that |d1-d2|>λ (Please see Figure 17(c) and 17(d) for details for the datasets S1 and S2). Consequently we get that

|RratioS1(Ω,0)-RratioS2(Ω,0)|=|d1-d2|>λ.

For a separable symmetric data S and 0-1 loss function, the Rashomon set R^set(Ω,0) contains all models that separate data in the same way. Therefore:

R^nS(R^set(Ω,0))=1n𝔼σ[supfR^set(Ω,0)i=1nσif(xi)]=1n𝔼σ[i=1nσif^(xi)]=0.

Equality of the empirical Rademacher complexity of the optimal model to zero follows from the symmetric data considered and symmetrical patterns of all possible target assignments. For example, for a toy dataset in Figure 17(b): R^nS(R^set(Ω,0))=1214((f(x1)+f(x2))+(f(x1)-f(x2))+(-f(x1)+f(x2))+(-f(x1)-f(x2)))=0. Since both S1 and S2 are separable and symmetric we get that:

R^nS1(R^set(Ω,0))=0=R^nS2(R^set(Ω,0)).

(a) R^ratio(Ω,θ)=dγ
(b) Toy dataset
(c) R^ratioS1(Ω,0)=d1
(d) R^ratioS2(Ω,0)=d2
Figure 18: An illustration of different Rashomon ratios with equivalent empirical local Rademacher complexities. Black line shows the optimal model, shaded region indicates the Rashomon set R^set(Ω,0) with its models represented by green lines, the magenta color indicates boundaries of the hypothesis space. (a) shows that the projected minimal distance d is equivalent to the Rashomon volume. (b) is a toy dataset that illustrates that the empirical local Rademacher complexity is zero for models in the Rashomon set. (c) and (d) illustrate symmetric separable datasets with different Rashomon ratios. Best seen in color.

Appendix G Dataset descriptions

We provide a description of the datasets used in our experiments in Table 5. All of them we downloaded from the UCI machine learning repository. We show a number of features and classes in each dataset, sizes of the training and the test data (for each dataset we split the available data to save 80% for training purposes and 20% for testing) and any preprocessing steps that we used. All the real-valued dataset we normalized to fit the unit-cube and we did not standardize the data. During data processing, we omit data records with missing values and not real-valued features (e.g. date, text) if we do not convert them to categorical.

We also performed experiments on twelve synthetic binary classification datasets. These datasets have two real features and represent different geometrical concepts for two-dimensional classification (e.g. large and small margins, concentrated circles, half moons, etc.). Results and implications for synthetic datasets are consistent with ones that we provided for the UCI datasets.

Dataset Name Features type #features #classes Train set size Test set size Processing notes
Monks-1 Binary 15 2 444 112
Monks-2 Binary 15 2 480 121
Monks-3 Binary 15 2 443 111
Voting Binary 16 2 185 47
SPECT Binary 22 2 213 54
Tic-tac-toe Binary 27 2 766 192
Hayes-Roth Binary 12 2 103 26 Classes 1 and 2 considered only
Nursery-1 Binary 27 2 6868 1718 Class not_recom and priority considered only
Nursery-2 Binary 27 2 6648 1662 Class priority and spec_prior considered only
Mushroom Binary 117 2 6499 1625
Breast Cancer Binary 43 2 228 58
Car Evaluation Binary 21 4 1382 346
Primary Tumor Binary 31 4 142 36 Top four representative classes selected (1, 5, 11, 18)
Mammographic masses Binary 25 2 664 166
Phishing data Binary 23 3 1082 271
Wine Real 13 3 142 36
Iris Real 4 3 120 30
Breast Cancer Wisconsin Real 30 2 455 114
Breast Cancer Coimbra Real 9 2 92 24
Digits 0-4 Real 64 2 290 73 Classes 0 and 4 considered only
Digits 6-8 Real 64 2 284 71 Classes 6 and 8 considered only
Student Real 3 2 320 80
Banknote Real 4 2 1097 275
Mapping Real 28 6 8436 2109
Wifi localization Real 7 4 1600 400
Column 2C Real 6 2 248 62
Credit Card Real 23 2 24000 6000
Planing Relax Data Real 12 2 145 37
Diabetic Retinopathy Real 19 2 920 231
Survival Real 3 2 244 62
Skin segmentation Real 3 2 196045 49012
HTRU_2 Real 8 2 14318 3580
Magic data Real 10 2 15216 3804
Seeds Real 7 3 168 42
MNIST 0-1 Real 784 2 11623 2115 Classes 0 and 1 considered only
MNIST 4-9 Real 784 2 10761 1991 Classes 4 and 9 considered only
Table 5: Classification datasets description and processing notes.

Appendix H Quality of the features

In our experiments, we also observed a connection between the quality of the features and Rashomon ratios. The Rashomon ratio, as defined in (1) in its simplest form, is a volume fraction of models that are inside the Rashomon set compared to the models in the hypothesis space. When a dataset is augmented with additional features, the size of the hypothesis space grows. If the added features are completely irrelevant (consisting, for instance, of noise) then adding these features increases the size of the hypothesis space but does not increase the size of the Rashomon set. Thus, we might predict that the Rashomon ratio could decrease as irrelevant features are added to a dataset.

Conversely, if we augment a dataset with features that are highly correlated or identical to features that improve performance, then not only is the size of the hypothesis space increased, but also the size of the Rashomon set is likely to increase, as there exist more relevant models (even if the set becomes redundant with models that predict equivalently). Thus, we might predict that the Rashomon ratio increases as we add copies of relevant features.

In general, these two examples of irrelevant and redundant features are corner causes, however, they do occur to a lesser degree in real world datasets, and we are interested in whether these cases have potentially influenced our experimental results in Section 7 in our observed Rashomon ratios. To investigate this, we augmented a dataset with noise features, and separately, augment the same dataset with copies of useful features to see whether irrelevant or correlated features may have influenced our findings on the measurement of the Rashomon ratio. We used the Wine dataset, which has approximately four important features. The results are shown in Figure 19. As before, our hypothesis space is decision trees of depth seven.

Irrelevant features. If the dataset contains a lot of irrelevant or noisy features we expect the Rashomon set to be relatively small compared to the hypothesis space. Figure 19(a) shows how the Rashomon ratio changes as we iteratively decrease the number of features in the Wine dataset, eliminating the least relevant features first, leaving the most significant ones (where relevance is determined according to a χ2 test with the label). The Rashomon ratio grows as we first remove non-significant features, and after reaching a peak at around four features, it starts to decrease as we remove relevant features, and as models lose accuracy. Similarly, Figure 19(b) shows the influence of noisy features on the Rashomon ratio. Particularly, as we add more noisy irrelevant features, the Rashomon ratio starts to decrease. This is due to the same fact, that we artificially enlarge the hypothesis space while keeping the Rashomon set approximately the same. The noise features do not help improve the empirical risk, they only increase the size of the reasonable set.

Redundant features. As a contrast to how we increased the hypothesis space in the previous experiment, we can increase the Rashomon set by adding more redundant, good features. Figure 19(c) shows how the Rashomon ratio changes for the Wine dataset as we add more copies of the four the most significant features. We observe that the Rashomon ratio increases. By adding copies of relevant features, we had increased the opportunity of randomly sampling good splits when we sample decision trees.

Our findings show a possible connection between the Rashomon ratio and feature analysis. In particular, in the case where different algorithms perform similarly, but the Rashomon ratio is observed to be small, it could be due to the reason that the dataset contains noisy or irrelevant features. In that case, it may be possible to iteratively remove features to find those that produce the largest Rashomon ratio without changes to the empirical risk. The other extreme is less likely to be observed in practice, which is when the Rashomon ratio is extremely large due to redundant features. In that case, one could remove redundant (highly correlated) features before measuring the Rashomon ratio. The datasets with smaller numbers of features induce easier learning/optimization problems in general. As we discussed earlier, the Rashomon ratio would generally not be measured in practice, and would be inferred in other ways. Thus, these results mainly pertain to an understanding of the experiments we did in Section 7 to provide a possible explanation for cases of small observed Rashomon ratios but where all methods perform the same and all functions generalize.

(a) Reducing features
(b) Adding noise
(c) Adding good features
Figure 19: An illustration of the influence of feature quality on the Rashomon ratio for the Wine dataset. (a) shows the Rashomon ratio for the dataset with different number of significant features according to a χ2 test. Denote the Wine dataset with four the most significant features as Wine4. (b) depicts the correspondence between the Rashomon ratio and different amounts of noisy features added to Wine4 dataset. The noise features are sampled from normal distribution 𝒩(0,1) and then standardized to be in a hypercube of volume=1. (c) shows the change in the Rashomon ratio as we add more redundant features to the Wine dataset. We iteratively add one out of four features from the Wine4 dataset at a time.

Appendix I Performance of different machine learning algorithms and Rashomon ratio

Figure 20 and Figure 21 show the comparison of the performance of different machine learning algorithms for all categorical and real-valued datasets respectively.

Figure 20: Performance of top five machine learning algorithms for the UCI classification datasets with categorical features.
Figure 21: Performance of top five machine learning algorithms for the UCI classification datasets with real-valued features.
Figure 22: Performance of top five machine learning algorithms for the synthetic datasets with real-valued features.

Appendix J Rashomon curve plots for all datasets

Figure 23 and Figure 24 show the Rashomon curves for all categorical and real-valued datasets respectively. The Rashomon elbow generalization error for the UCI-datasets we considered is shown in Figure 26 for categorical data and in Figure 27 for real-valued data. Finally, Figure 29 shows the Rashomon trend for a hierarchy of polynomial hypothesis spaces for ridge regression Table 6 describes datasets, including characteristics and pre-processing notes, that were used in regression experiments.

Figure 23: Rashomon curves for the UCI classification datasets with categorical features. The hierarchy of hypothesis spaces is the set of decision trees from depth one to seven. We display only hypothesis classes that have positive Rashomon ratio. The Rashomon parameter θ was set to be 0.05 for all experiments.
Figure 24: Rashomon curves for the UCI classification datasets with real-valued features. The hierarchy of hypothesis spaces is the set of decision trees from depth one to seven. We display only hypothesis classes that have positive Rashomon ratio. The Rashomon parameter θ was set to be 0.05 for all experiments.
Figure 25: Rashomon curves for the synthetic datasets with two real-valued features. The hierarchy of hypothesis spaces is the set of decision trees from depth one to seven. We display only hypothesis classes that have positive Rashomon ratio. The Rashomon parameter θ was set to be 0.05 for all experiments.
Figure 26: The generalization ability of the Rashomon elbow for the UCI classification datasets with categorical features. The hierarchy of hypothesis spaces are fully grown decision trees from depth one to seven. We display only hypothesis classes that have positive Rashomon ratio. The Rashomon parameter θ was set to be 0.05 for all experiments.
Figure 27: The generalization ability of the Rashomon elbow for the UCI classification datasets with real-valued features. The hierarchy of hypothesis spaces are fully grown decision trees from depth one to seven. We display only hypothesis classes that have positive Rashomon ratio. The Rashomon parameter θ was set to be 0.05 for all experiments.
Figure 28: The generalization ability of the Rashomon elbow for synthetic datasets with two real-valued features. The hierarchy of hypothesis spaces are fully grown decision trees from depth one to seven. We display only hypothesis classes that have positive Rashomon ratio. The Rashomon parameter θ was set to be 0.05 for all experiments.
Dataset Name #features Train set size Test set size Processing notes
Boston 13 404 102
Diabetes 10 353 89
Air quality 11 5552 1389 Dropped features “Date”, “Time”, “NMHC(GT)” and entries with missing values
Concrete 8 824 206
Airfoil 5 1201 301
CASP 9 36584 9146
Sgemm 14 193280 48320 Dropped targets “Run2 (ms)”, “Run3 (ms)”, “Run4 (ms)”
Energy 24 15788 3947 Dropped features “date”, “rv1”, “rv2”, target “lights”
Auto mpg 7 313 79 Dropped car name feature and entries with missing values
Bike Rent 13 584 147 Dropped features “instant”, “dteday”
Parkinson 20 4700 1175 Dropped “motor_UPDRS”
Yacht 6 246 62
CCPP 4 7654 1914
Table 6: Regression datasets description and processing notes.
Figure 29: The generalization ability of the Rashomon elbow for the UCI regression datasets with real features. We consider only 3 features for every dataset with the largest corresponding singular values. The hierarchy of hypothesis spaces consists of polynomials of degree from one to eight. We display only hypothesis classes that have positive Rashomon ratio. The Rashomon parameter θ was set to be 0.05 for all experiments. The regularization parameter for ridge regression was 0.1.

Appendix K Possible ways to compute the Rashomon Elbow

Let us create some simple ways to formalize how we might find the elbow. For a fixed θ, the elbow is the hypothesis space He with the highest Rashomon ratio among all model classes in the hierarchy that can approximately minimize the empirical risk as in Equation 6. The location of the Rashomon elbow can be found be solving an approximate maximization problem, where G(,):2 is a user-defined balance between accuracy and the Rashomon ratio. In particular, the elbow can be defined as a hypothesis space He such that:

HeargmaxHtH1HTG(1-L^(Ht),R^ratio(Ht,θt)). (13)

Here, G would be chosen by the user to represent the ideal balance between accuracy and the Rashomon ratio, and the same G would be used for potentially many different problems for consistency.

As an alternative definition for the Rashomon elbow, we can use a geometric argument, based on intuition provided by Figure 11. Across all hypothesis spaces, the hypothesis space that corresponds to the Rashomon elbow has the largest distance from a line that connects points (L^H0,R^ratio(H0,)) and (L^HT,R^ratio(HT,)). This geometrically-defined Rashomon elbow will generally result in the same maximin point as in Equation (13).