On Explaining Machine Learning Models by Evolving Crucial and Compact Features

  • 2019-07-04 07:52:23
  • Marco Virgolin, Tanja Alderliesten, Peter A. N. Bosman
  • 23

Abstract

Feature construction can substantially improve the accuracy of MachineLearning (ML) algorithms. Genetic Programming (GP) has been proven to beeffective at this task by evolving non-linear combinations of input features.GP additionally has the potential to improve ML explainability since explicitexpressions are evolved. Yet, in most GP works the complexity of evolvedfeatures is not explicitly bound or minimized though this is arguably key forexplainability. In this article, we assess to what extent GP still performsfavorably at feature construction when constructing features that are (1) Ofsmall-enough number, to enable visualization of the behavior of the ML model;(2) Of small-enough size, to enable interpretability of the featuresthemselves; (3) Of sufficient informative power, to retain or even improve theperformance of the ML algorithm. We consider a simple feature constructionscheme using three different GP algorithms, as well as random search, to evolvefeatures for four ML algorithms, including support vector machines and randomforest. Our results on 20 datasets pertaining to classification and regressionproblems show that constructing only two compact features can be sufficient torival the use of the entire original feature set. We further find that a modernGP algorithm, GP-GOMEA, performs best overall. These results, combined withexamples that we provide of readable constructed features and of 2Dvisualizations of ML behavior, lead us to positively conclude that GP-basedfeature construction still works well when explicitly searching for compactfeatures, making it extremely helpful to explain ML models.

 

Quick Read (beta)

On Explaining Machine Learning Models by Evolving Crucial and Compact Features

Marco Virgolin Life Sciences and Health Group, Centrum Wiskunde & Informatica, Amsterdam 1098 XG, the Netherlands Tanja Alderliesten Department of Radiation Oncology, Amsterdam UMC, Amsterdam 1105 AZ, the Netherlands Peter A.N. Bosman Life Sciences and Health Group, Centrum Wiskunde & Informatica, Amsterdam 1098XG, the Netherlands Algorithmics Group, Delft University of Technology, Delft 2628 XE, the Netherlands
Abstract

Feature construction can substantially improve the accuracy of Machine Learning (ML) algorithms. Genetic Programming (GP) has been proven to be effective at this task by evolving non-linear combinations of input features. GP additionally has the potential to improve ML explainability since explicit expressions are evolved. Yet, in most GP works the complexity of evolved features is not explicitly bound or minimized though this is arguably key for explainability. In this article, we assess to what extent GP still performs favorably at feature construction when constructing features that are (1) Of small-enough number, to enable visualization of the behavior of the ML model; (2) Of small-enough size, to enable interpretability of the features themselves; (3) Of sufficient informative power, to retain or even improve the performance of the ML algorithm. We consider a simple feature construction scheme using three different GP algorithms, as well as random search, to evolve features for four ML algorithms, including support vector machines and random forest. Our results on 20 datasets pertaining to classification and regression problems show that constructing only two compact features can be sufficient to rival the use of the entire original feature set. We further find that a modern GP algorithm, GP-GOMEA, performs best overall. These results, combined with examples that we provide of readable constructed features and of 2D visualizations of ML behavior, lead us to positively conclude that GP-based feature construction still works well when explicitly searching for compact features, making it extremely helpful to explain ML models.

keywords:
feature construction, interpretable machine learning, genetic programming, GOMEA
journal: Swarm and Evolutionary Computation\usetikzlibrary

shapes.geometric, arrows \tikzstylesquarered = [rectangle, minimum width=3cm, minimum height=1cm, text centered, draw=black, fill=red!10] \tikzstylesquaregreen = [rectangle, minimum width=3cm, minimum height=1cm, text centered, draw=black, fill=green!10] \tikzstylesquareblue = [rectangle, minimum width=3cm, minimum height=1cm, text centered, draw=black, fill=blue!10] \tikzstylediamondgreen = [diamond, minimum width=3cm, minimum height=1cm, text centered, draw=black, fill=green!10] \tikzstylediamondblue = [diamond, minimum width=3cm, minimum height=1cm, text centered, draw=black, fill=blue!10] \tikzstylediamondred = [diamond, minimum width=3cm, minimum height=1cm, text centered, draw=black, fill=red!10] \tikzstylediamondorange = [diamond, minimum width=3cm, minimum height=1cm, text centered, draw=black, fill=orange!10] \tikzstylecirclewhite = [circle, minimum width=0.5cm, minimum height=0.5cm, text centered, draw=black] \tikzstylecircleyellow = [circle, minimum width=1cm, minimum height=1cm, text centered, draw=black, fill=yellow!10] \tikzstylesquareryellow = [rectangle, rounded corners, minimum width=3cm, minimum height=1cm, text centered, draw=black, fill=yellow!10] \tikzstyleinvisible = [rectangle, minimum width=0.5cm, minimum height=0.5cm, text centered] \tikzstylearrow = [thick,-¿,¿=stealth] \tikzstyledarrow = [dashed,-¿,¿=stealth] \tikzstyleline = [thick,-,¿=stealth] \algrenewcommand\alglinenumber[1]#1

1 Introduction

Feature selection and feature construction are two important steps to improve the performance of any Machine Learning (ML) algorithm liu1998feature ; friedman2001elements . Feature selection is the task of excluding features that are redundant or misleading. Feature construction is the task of transforming (parts of) the original feature space into one that the ML algorithm can better exploit.

A very interesting method to perform feature construction automatically is Genetic Programming (GP) koza1992gp ; poli2008field . GP can synthesize functions without many prior assumptions on their form, differently from, e.g., logistic regression or regression splines friedman1991multivariate ; hosmer2013applied . Moreover, feature construction not only depends on the data at hand, but also on the way a specific ML algorithm can model that data. Evolutionary methods in general are highly flexible in their use due to the way they perform search (i.e., derivative free). This makes it possible, for example, to evaluate the quality of a feature for a specific ML algorithm by directly measuring what its impact is on the performance of the ML algorithm (i.e., by training and validating the ML algorithm when using that feature).

Figure 1: Regression surface learned by SVM for the Yacht dataset (in blue), expressed as a 2D function of the two features (on the bottom axes) constructed by our approach. Circles are training samples, diamonds are test samples. The dataset has six features (x(i)). Our approach constructs two new features (using GP-GOMEA, see Sec. 4.1), which are non-linear transformations of the prismatic coefficient (x(2)) and the Froude number (x(6)). With only two features the SVM prediction surface can be visualized. Moreover, these new features are understandable. Finally, the modeling quality is actually improved over employing SVM directly on all six features. The coefficient of determination of SVM increased from 85% using the original features to 98% using the two new features.

Explaining what constructed features mean can shed light on the behavior of ML-inferred models that use such features. Reducing the number of features is also important to improve interpretability. If the original feature space is reduced to few constructed features (e.g., up to two for regression and up to three for classification), the function learned by the ML model can be straightforwardly visualized w.r.t. the new features. In fact, how to make ML models more understandable is a key topic of modern ML research, as many practical, sensitive applications exist, where explaining (part of) the behavior of ML models is essential to trust their use (e.g., in medical applications) lipton2018mythos ; guidotti2018survey ; adadi2018peeking ; goodman2017european . Typically, GP for feature construction searches in a subspace of mathematical expressions. Adding to the appeal and potential of GP, these expressions can be human-interpretable if simple enough guidotti2018survey ; virgolin2019model .

Figure 1 presents an example of the potential held by such an approach: a multi-dimensional dataset transformed into a 2D one, where both the behavior of the ML algorithm and the meaning of the new features is clear, while the performance of the ML algorithm is not compromised w.r.t. the use of the original feature set (it is actually improved).

In this article we study whether GP can be useful to construct a low number of small features, to increase the chance of obtaining interpretable ML models, without compromising their accuracy (compared to using the original feature set). To this end, we design a simple, iterative feature construction scheme, and perform a wide set of experiments: we consider four types of feature construction methods (three GP algorithms and random search), four types of machine learning algorithms. We apply their combinations on 20 datasets between classification and regression to determine to what extent they are capable of effectively and efficiently finding crucial and compact features for specific ML algorithms.

The remainder of this article is organized as follows. Related work is reported in Section 2. The proposed feature construction scheme is presented in Section 3. The search algorithms to construct features, as well as the considered ML algorithms, are presented in Section 4. The experimental setup is described in Section 5. Results related to performance are reported in Section 6, while results concerning interpretability are reported in Section 7. Section 9 discusses our findings, and Section 10 concludes this article.

2 Related Work

Since this article focuses on feature construction, we report here related work that at least includes construction of new features. For contributions that focus exclusively on feature selection, the reader is referred to a recent survey xue2016survey .

One of the first approaches is presented in krawiec2002genetic , where each GP solution is a set of K features. The fitness of a set is the cross-validation performance of a decision tree breiman2017classification using that set. The results on six classification datasets show that the approach is able to synthesize a feature set that is competitive with the original one, and can also be added to the original set for further improvements. No attention is however given to the interpretability of evolved features.

The work in muharram2005evolutionary generates one feature with Standard, tree-based GP (SGP) koza1992gp , to be added to the original set. Feature importance metrics of decision trees such as information gain, Gini index and Chi2 are used as fitness measure. An advantage of using such fitness measures over ML performance is that they can be computed very quickly. However, they are decision tree-specific. Results show that the approach can improve prediction accuracy, and, for a few problems, it is shown that decision trees that are simple enough to be reasonably interpretable, can be found.

Feature construction for high-dimensional datasets is considered in tran2016genetic , for eight bio-medical binary classification problems, with 2,000 to 24,188 features. This approach is different from the typical ones, as the authors propose to use SGP to evolve classifiers rather than features, and extract features from the components (subtrees) of such classifiers. These are then used as new features for an ML algorithm. Results on K-Nearest Neighbors altman1992introduction , Naive Bayes classifier russell2016artificial ; murphy2006naive , and decision tree show that a so-found feature set can be competitive or outperform the original one. The authors show an example where a single interpretable feature is constructed that enables linear separation of the classification examples.

Different from the aforementioned works, chen2017genetic explores feature construction for regression. A SGP-based approach is designed to tackle regression problems with a large number of features, and is tested on six datasets. Instead of using the constructed features for a different ML algorithm, SGP dynamically incorporates them within an ongoing run, to enrich the terminal set. Every α generations of SGP, the subtrees composing the best solutions become new features by encapsulation into new terminal nodes. The approach is found to improve the ability of SGP to find accurate solutions. However, the features found by encapsulating subtrees are not interpretable because allowing subsequent encapsulations leads to an exponential growth of solution size.

A recent work that focuses on evolutionary dimensionality reduction and consequent visualization is cano2017multiobjective , where a multi-objective, grammar-based SGP approach is employed. K feature transformations are evolved in synergy to enable, at the same time, good classification accuracy, and visualization through dimensionality reduction. The system is thoroughly tested on 42 classification tasks, showing that the algorithm performs well compared to state-of-the-art dimensionality reduction methods, and it enables visualization of the learned space. However, as trees are free to grow up to a height of 50, the constructed features themselves cannot be interpreted.

The most similar works to ours that we found are virgolin2018gecco and tran2019genetic . In virgolin2018gecco , which is our previous work, the possibility of using a modern model-based GP algorithm (which we also use in our comparisons) for feature construction is explored on four regression datasets. There, focus is put on keeping feature size small, to actively attempt to obtain readable features. These features are iteratively constructed to be added to the original feature set to improve the performance of the ML algorithm, and three ML algorithms are compared (linear regression, support vector machines cortes1995support , random forest breiman2001random ). Reducing the feature space to enable a better understanding of inferred ML models is not considered.

In tran2019genetic , different feature construction approaches are compared on gene-expression datasets that have a large number of features (thousands to tens of thousands) to study if evolving class-dependent features, i.e., features that are each targeted at aiding the ML algorithm detect one specific class, can be beneficial. Similarly to us, the authors show visualizations of feature space reduced to up to three constructed features, and an example of three features that are encoded as very small, easy-to-interpret trees. However, such small features are a rare outcome as the trees used to encode features typically had more than 75 nodes. These trees are therefore arguably extremely hard to read and interpret.

Our work is different from previous research in two major aspects. First, none of the previous work principally addresses the conflicting objectives of retaining good performance of an ML algorithm while attempting to explain both its behavior (by dimensionality reduction to allow visualization), and the meaning of the features themselves (by constraining feature complexity). Second, multiple GP algorithms within a same feature construction scheme, on multiple ML algorithms, are not compared in previous work. Most of the times, it is a different feature construction scheme that is tested, using arguably small variations of SGP. Here, we consider random search, two versions of SGP, as well as another modern GP algorithm. Furthermore, we adopt both “weak” ML algorithms such as ordinary least squares linear regression and the naive Bayes classifier, as well as “strong”, state-of-the-art ones, which are rarely used in literature for feature construction, such as support vector machine and random forest; on both classification and regression tasks.

{tikzpicture}

[node distance=4cm]

\node

(Training) [squaregreen] Tr iteration k-1 x(1) x(k-1) y 22.49 -3.10 10.4 12.98 -7.41 7.49 ;

\node

(GP) [diamondorange, right of=Training, xshift=1cm, yshift=2.0cm] GP;

\node

(eTraining) [squaregreen, right of=GP, xshift=1.25cm] Tr iteration k x(k-1) x(k) y -3.10 7.12 10.4 -7.41 9.41 7.49 ;

\node

(eTest) [squareblue, below of=eTraining, yshift=0.25cm] Te iteration k x(k-1) x(k) y 9.87 1.11 5.55 6.45 4.78 12.01 ;

\node

(eFeatureContainer) [invisible, below of=GP, xshift=-0.0cm, yshift=0cm, minimum height=4cm,minimum width=2.0cm];

\node

(eFeatureN1) [circlewhite, below of=GP, yshift=1.75cm] ;

\node

(eFeatureN2) [circlewhite, below of=GP, xshift=-0.5cm, yshift=1.0cm] ;

\node

(eFeatureN3) [circlewhite, below of=GP, xshift=0.5cm, yshift=1.0cm] ;

\node

(eFeatureN4) [circlewhite, below of=GP, xshift=-1cm, yshift=0.25cm] ;

\node

(eFeatureN5) [circlewhite, below of=GP, xshift=0cm, yshift=0.25cm] ;

\node

(eFeatureCap) [invisible, below of=GP, yshift=-0.5cm] Best New Feature x(k) ;

\node

(ML algorithm) [diamondorange, right of=eTraining, xshift=1.5cm] ML alg. ;

\node

(Model) [squarered, below of=ML algorithm, yshift=1cm] Trained Model ;

\node

(Prediction) [invisible, below of=Model, yshift=1.5cm] Test Error k ;

\draw

[darrow] (Training) —- (GP); \draw[arrow] (GP) – (eFeatureContainer); \draw[arrow] (eFeatureContainer) ++(25pt,25pt) -— ++(20pt,40pt) —- ++(46pt,25pt); \draw[arrow] (eFeatureContainer) ++(25pt,25pt) -— ++(20pt,40pt) —- (eTest); \draw[line] (eFeatureN1) – (eFeatureN2); \draw[line] (eFeatureN1) – (eFeatureN3); \draw[line] (eFeatureN2) – (eFeatureN4); \draw[line] (eFeatureN2) – (eFeatureN5);

\draw

[darrow] (eTraining) – (ML algorithm); \draw[arrow] (ML algorithm) – (Model); \draw[darrow] (eTest) ++(63pt,21pt) – (Model); \draw[arrow] (Model) – (Prediction);

Figure 2: Construction of the k-th feature and computation of the k-th test error. Evolved features use the features of the original dataset (not shown) and random constants as terminal nodes. Dashed arrows represent inputs, solid arrows represents outputs.

3 Iterative Evolutionary Feature Construction

We use a remarkably simple scheme to construct features. Our approach constructs K+ features by iterating K GP runs. The evolution of the k-th feature (k{1,,K}) uses the previously constructed k-1 features.

3.1 Feature Construction Scheme

The dataset D defining the problem at hand is split into two parts: the training Tr and the test Te set. This partition is kept fixed through the whole procedure. Only Tr is used to construct features, while Te is exclusively used for final evaluation to avoid positive bias in the results kohavi1997wrappers . We use the notation xj(i) to refer to the i-th feature value of the j-th example, and yj for the desired outcome (label for classification or target value for regression) of the j-th example.

The k-th GP run evolves the k-th feature. An example is shown in Figure 2. Each solution in the population competes to become the new feature x(k), that represents a transformation of the original feature set. In every run, the population is initialized at random.

We evaluate the fitness of a feature of the k-th run by measuring the performance of the ML algorithm on a dataset that contains that feature and the previously evolved k-1 features.

We only use original features (and random constants) as terminals. In particular, the features constructed by previous iterations are not used as terminal nodes in the k-th run. This prevents the generation of nested features, which could harm interpretability.

At the end of the k-th run, the best feature is stored and its values xj(k) are added to Tr and Te for the next iterations.

3.2 Feature Fitness

The fitness of a feature is computed by measuring the performance (i.e., error) of the ML algorithm when the new feature is added to Tr. We consider the C-fold cross-validation error rather than the training error to promote generalization and prevent overfitting. The pseudo code of the evaluation function is shown in Algorithm 1.

Specifically, the C-fold cross-validation error is computed by partitioning Tr into C splits. For each c=1,,C iteration, a different split is used for validation (set Vc), and the remaining C-1 splits are used for training (set Trc). The mean validation error is the final result.

For classification tasks, in order to take into account both multiple and possibly imbalanced class distributions, the prediction error is computed as 1 minus the macro F1 score, i.e., 1 minus the mean of the class-specific F1 scores:

1-F1 =1-1#𝑐𝑙𝑎𝑠𝑠𝑒𝑠γ𝑐𝑙𝑎𝑠𝑠𝑒𝑠F1γ
=1-2#𝑐𝑙𝑎𝑠𝑠𝑒𝑠γ𝑐𝑙𝑎𝑠𝑠𝑒𝑠TPγTPγ+FPγTPγTPγ+FNγTPγTPγ+FPγ+TPγTPγ+FNγ,

where TPγ,FNγ,FPγ are the true positive, false negative, and false positive classifications for the class γ, respectively. If the computation of F1γ results in 00, we set F1γ=0.

For regression, the prediction error is computed with the Mean Squared Error (MSE).

Algorithm 1 Computation of the fitness of a feature s
{algorithmic}

[1] \FunctionComputeFeatureFitnesss \StateTrAddFeatureToCurrentTrainingSet(s) \Stateerror0 \Forc=1,,C \StateTc,VcSplitSet(c,C,Tr) \StateMTrainMLModel(Tc) \Stateerrorerror+ComputeError(M,Vc) \EndFor\StateReturn(errorC) \EndFunction

3.3 Preventing Unnecessary Fitness Computations

Computing the fitness of a feature is particularly expensive, as it consists of a C-fold cross-validation of the ML algorithm. This limits the feasibility of, e.g., adopting large population sizes and large numbers of evaluations for the GP algorithms.

We therefore attempt to prevent unnecessary cross-validation calls, by assessing if features meet four criteria. Let n be the number of examples in Tr. The criteria are the following:

  1. 1.

    The feature is not a constant. We avoid evaluating constant features as they are likely to be useless for many ML algorithms, which internally already compute an intercept.

  2. 2.

    The feature does not contain extreme values that may cause numerical errors, i.e., with absolute value above a lower-bound β or above an upper-bound βu. Here, we set β=10-10, and βu=1010 (none of the datasets considered here have values exceeding these bounds).

  3. 3.

    The feature is not equivalent to one constructed in the previous k-1 iterations. Equivalence is determined by checking the values available in Tr, i.e., equivalence holds if:

    jTr,i{1,,k-1}:xj(k)=xj(i).

    Note that a constructed feature that is equivalent to a feature of the original feature set can be valid, as long as no other previously constructed feature exists that is already equivalent. Thus, our approach can in principle perform pure feature selection.

  4. 4.

    The values (in Tr) of the feature in consideration have changed since the last time the feature was evaluated. This is because GP variation is not guaranteed to change the output of solutions (e.g., a multiplication by 1 is inserted). This is realized by caching the latest feature values after evaluation.

The computational effort for each criterion is O(n) (it is O((k-1)n) for criterion 3, however in our experiments kn). The fitness of a feature failing criterion 1, 2, or 3 is set to the maximum possible error value. If criterion 4 fails, the fitness remains the same (although performing cross-validation may lead to slightly different results when using stochastic ML algorithms like random forest).

4 Considered Search Algorithms and Machine Learning Algorithms

We consider SGP, Random Search (RS), and the GP instance of the Gene-pool Optimal Mixing Evolutionary Algorithm (GP-GOMEA) as competing search algorithms to construct features. SGP is widely used in feature construction (see related work in Sec. 2). RS is not typically considered, yet we believe it is important to assess whether evolution does bring any benefit over random enumeration within the confines of our study, i.e., when forcing to find small features. GP-GOMEA is a recently introduced GP algorithm that has proven to be particularly proficient in evolving accurate solutions of limited size virgolin2019model ; virgolin2017scalable ; virgolin2018gecco .

As ML algorithms, we consider the Naive Bayes Classifier (NBC), ordinary least-squares Linear Regression (LR), Support Vector Machines (SVM), and Random Forest (RF). NBC is used only for classification tasks, LR only for regression tasks, SVM and RF for both tasks. We provide more details in the following sections.

4.1 Details on the Search Algorithms

All search algorithms use the fitness evaluation function. A feature s is evaluated by first checking whether the four criteria of Section 3.3 are met, and then, if the outcome is positive, by running the ML algorithm over the feature-extended dataset.

For SGP, we use subtree crossover and subtree mutation, picking the depth of subtree roots uniformly randomly as proposed in pawlak2015semantic . The candidate parents for variation are chosen with tournament selection. Since we are interested in constructing small features so as to increase the chances they will be interpretable, we consider two versions of SGP. The first is the classic one where solutions are free to grow to tree heights typically much larger than the one used for tree initialization. In the following, the notation SGP refers to this first version. The second one uses trees that are not allowed to grow past the initial maximum tree height. We call this version bounded SGP, and use the notation SGPb.

RS is realized by continuously sampling and evaluating new trees, keeping the best koza1992gp . Like for SGPb, a maximum tree height is fixed during the whole run. If evolution is hypothetically no better than RS, then we expect that SGPb and GP-GOMEA will construct features that are no better than the ones constructed by RS.

GP-GOMEA is a recently introduced GP algorithm that has been found to deliver accurate solutions of small size on benchmark problems virgolin2017scalable , and to work well when a small size is enforced in symbolic regression virgolin2019model ; virgolin2018gecco . GP-GOMEA uses a tree template fixed by a maximum tree height (which can include intron nodes to allow for unbalanced tree shapes) and performs homologous variation, i.e., mixed tree nodes come from the same positions in the tree. Each generation prior to mixing, a hierarchical model that captures interdependencies (linkage) between nodes is built (using mutual information). This model, called Linkage Tree (LT), drives variation by indicating what nodes should be changed en block during mixing, to avoid the disruption of patterns with large linkage.

The LT has been shown to enable GP-GOMEA to outperform subtree crossover and subtree mutation of SGP, as well as the use of a randomly-build LT, i.e., the Random Tree (RT), on problems of different nature virgolin2019model ; virgolin2017scalable . However, the LT requires sufficiently large population sizes to be accurate and beneficial (e.g., several thousand solutions in GP for symbolic regression) virgolin2019model . Because in the framework of this article fitness evaluations use the cross-validation of a ML algorithm, we cannot afford to use large population sizes. Accordingly, we found the adoption of the LT to not be superior to the adoption of the RT under these circumstances in preliminary experiments. Therefore, for the most part, we adopt GP-GOMEA with the RT (GP-GOMEART). This means we effectively compare random hierarchical homologous variation with subtree-based variation. An example of adopting the LT and large population sizes for feature construction is provided in Section 9.

4.2 Details on the ML algorithms

We now briefly describe the ML algorithms used in this work: NBC, LR, SVM, and RF. NBC and LR are less computationally expensive compared to SVM and RF. In the following, k is the number of features and n is the number of examples in the training set Tr.

NBC is a classifier which assumes independence between features russell2016artificial ; murphy2006naive . NBC is often used as a baseline, as it is simple and fast to train. Training an NBC has computational complexity of O(nk), while classifying a new example takes O(k×#𝑐𝑙𝑎𝑠𝑠𝑒𝑠). We use the mlpack implementation of NBC mlpack2013 and assume the data to be normally distributed (default setting).

Similarly to NBC, LR is often used as a baseline as it is simple and fast, for regression tasks. LR assumes that the target variable can be explained by a linear combination of the features russell2016artificial . The training computational complexity of LR is O(nk2). Predicting the target variable for a new example requires O(k). We use the mlpack implementation of LR mlpack2013 .

SVM is a powerful ML algorithm that can be used for classification and regression cortes1995support ; CC01a . A fundamental part of SVM is the kernel trick, which allows to project the features in a different space. We use the libsvm C++ implementation CC01a , which has a training complexity between O(n2k) and O(n3k) depending on the specific training set. Prediction complexity is O(vk), with v the number of support vectors, which is parameter dependent. We consider the Radial Basis Function (RBF) kernel, which works well in practice for many problems, with C-SVM for classification, and -SVM for regression.

RF is an ensemble ML algorithm which, like SVM, can be used for both classification and regression breiman2001random . RF works by building an ensemble of decision trees, each trained on a sample of the training set (bagging). At prediction time, the mean (or maximum agreement) prediction of the decision trees is returned. Assuming decision trees can grow up to height O(logn), the computation time for training is O(τμnlogn), where τ is the number of decision trees and μ is the number of features considered for splitting nodes at decision tree construction, i.e., the mtry parameter (typically μ=k for classification and μ=k/3 for regression). The prediction takes O(τlogn). We use the ranger C++ implementation wright2015ranger .

Table 1: Parameter Settings of the GP Algorithms.
SGP(b) GP-GOMEART
Population size 100 100
Initialization method Ramped Half and Half Half and Half
Initialization max tree height 26 (2 or 4) 2 or 4
Max tree height 17 (2 or 4) 2 or 4
Variation SX 0.9, SM 0.1 parameter-less
Selection \makecellTournament 7, Elitism 1 parameter-less
Function set {+,×,-,÷,2,,logp,exp} for all
Terminal set {x(i),𝙴𝚁𝙲} for all

5 Experiments

We perform 30 runs of our Feature Construction Scheme (FCS), with SGP, SGPb, RS, and GP-GOMEART, in combination with each ML algorithm (NB only for classification and LR only for regression), on each problem. Each run of the FCS uses a random train-test split of 80%-20%, and considers up to K=5 features construction rounds. We use a population size of 100 for the search algorithms, and assign a maximum budget of 10,000 function evaluations to each FCS iteration. This results in relatively large running times for complex ML algorithms (see Sec. 8). An experiment including larger evolutionary budgets and the use of the LT in GP-GOMEA is presented in the discussion (Sec. 9). We use a limit on the total number of evaluations instead of a a limit on the total number of generations because GP-GOMEART performs more evaluations than SGP per generation virgolin2017scalable .

For GP-GOMEART, SGPb, and RS, we consider two levels of maximum tree height h: 2 and 4. This choice yields a maximum solution size of 7 and 31 respectively (using function nodes with a maximum arity r=2). We choose these two height levels because we found features with h=2 to be arguably easy to read and interpret, whereas features with h=4 can already be very hard to understand. This indication is also reported in virgolin2019model for the evolution of symbolic regression formulas. Note that using a tree height limit over a solution size limit prevents finding deep trees containing the nesting of the arguably more complicated to understand non-linear functions 2,,logp,exp. We do not consider bigger tree heights as resulting features may likely be impossible to interpret, defying a key focus of this work.

Other parameter settings used for the GP algorithms are shown in Table 1. SGPb uses the same settings as SGP, except for the maximum tree height (at initialization and along the whole run), which is set to the same of GP-GOMEART. In GP-GOMEART we use the Half and Half (HH) tree initialization method instead of the Ramped Half and Half (RHH) koza1992gp commonly used for SGP. This proved to be beneficial since GOM varies nodes instead of subtrees virgolin2019model ; virgolin2018gecco . For both HH and RHH, syntactical uniqueness of solutions is enforced for up to 100 tries koza1992gp . In GP-GOMEART we additionally avoid sampling trees having a terminal node as root by setting the minimum tree height of the grow method to 1. This is not done for SGP and SGPb, because differently from GP-GOMEART where homologous nodes are varied, subtree root nodes for subtree crossover (SX) and subtree mutation (SM) are chosen uniformly randomly. RS samples new trees using the same initialization method as SGPb, i.e., RHH.

The division operator ÷ used in the function set is the analytic quotient operator (a÷b=a/1+b2), which was shown to lead to better generalization performance than protected division ni2013use . The logarithm is protected logp()=log(||) and logp(0)=0, and so is the square root operator. The terminal set contains the original feature set, and an Ephemeral Random Constant (ERC) poli2008field with values uniformly sampled between the minimum and maximum values of the features in the original training set, i.e., [minxj(i),maxxj(i)],i,jTr.

The hyperparameter settings for the SVM and RF are shown in Table 2, and are mostly default breiman2001random ; CC01a ; wright2015ranger . NBC and LR implementations do not have hyperparameters.

We consider 10 classification and 10 regression benchmark datasets11 1 The datasets are available at http://goo.gl/9D2z3b. Details on the datasets are reported in Table 3. Rows with missing values are omitted. Most datasets are taken from the UCI Machine Learning repository22 2 http://archive.ics.uci.edu/ml/, with exception for Dow Chemical and Tower, which come from GP literature white2013better ; albinati2015effect .

Table 2: Hyperparameter Settings of SVM and RF.
SVM
Kernel RBF
Cost 1
Epsilon 0.1
Tolerance 0.001
Gamma 1k
Shrinking Active
RF
Number of trees 100
Bagging sampling with replacement
Classification mtry #features
Regression mtry min(1,#features3)
Min node size 1 classification, 5 regression
Split rule Gini classification, Variance regression
Table 3: Classification and Regression Datasets.
Dataset # Features # Examples # Classes

Classification

Cylinder Bands 39 277 2
Breast Cancer Wisc. 29 569 2
Ecoli 7 336 8
Ionosphere 34 351 2
Iris 4 150 3
Madelon 500 2600 2
Image Segmentation 19 2310 7
Sonar 60 208 2
Vowel 9 990 11
Yeast 8 1484 10

Regression

Airfoil 6 1503
Boston Housing 13 506
Concrete 9 1030
Dow Chemical 57 1066
Energy Cooling 9 768
Energy Heating 9 768
Tower 26 4999
Wine Red 12 1599
Wine White 12 4898
Yacht 7 308

6 Results: performance

The results described in this section aim at assessing whether it is possible to construct few and small features that lead to an equal or better performance than the original set, and whether some search algorithms can construct better features than others.

6.1 General Performance of Feature Construction

We begin by observing the dataset-wise aggregated performance of FCS for the different GP algorithms and the different ML algorithms, separately for classification and regression.

NB SVM RF

Training

Test

Training

Test

Training

Test

h=2

h=4

Figure 3: Aggregated results on the classification datasets. Horizontal axis: Number of features. Vertical axis: Average of median F1 score obtained on 30 runs for each dataset.
LR SVM RF

Training

Test

Training

Test

Training

Test

h=2

h=4

Figure 4: Aggregated results on the regression datasets. Horizontal axis: Number of features. Vertical axis: Average of median R2 score obtained on 30 runs for each dataset.

6.1.1 Classification

Figure 3 shows dataset-wise aggregated results obtained for NB, SVM, and RF, for classification tasks. Each data point is the mean among the dataset-specific medians of macro F1 from the 30 runs.

In general, the use of only one constructed feature does not perform as good as the use of the original feature set (with exception for NB). Constructing more features improves the performance, but with diminishing returns.

Specifically for NB, the use of a single constructed feature is already preferable to the use of the original feature set (unless RS is used as constructor). This is likely due to the fact that NB assumes complete independence between the provided features, and this can be implicitly tackled by FCS. SGP (unbounded) is the best performing algorithm as it can evolve arbitrarily complex features, however, the magnitude of improvement of the macro F1 score w.r.t. GP-GOMEART and SGPb is limited. For h=4 and K=5, GP-GOMEART reaches the performance of SGP. GP-GOMEART is typically slightly better than SGPb, and RS has worse performance.

The performance of FCS for SVM has an almost identical pattern to the one observed for NB, except for the fact that the performance is found to be consistently better. However, for SVM it is preferable to use the original feature set rather than few constructed features. This is evident in terms of training performance, but less at test time. In fact, using only 5 constructed features leads to similar test performance compared to using the original set. The GP algorithms compare to each other similarly to when using NB.

The way performance improves for RF by constructing features is similar to the one observed for NB and SVM. However, for RF the differences between the search algorithms is particularly small: notice that using RS leads to close performance to the ones obtained by using the other GP algorithms, compared to the NB and the SVM case. Moreover, virtually no difference can be seen between GP-GOMEART and SGPb. This suggests that RF already works well with less refined features. Now, the features constructed by SGP are no longer the best performing at test time. This is likely because SGP evolves larger, more complex features than the other algorithms (see Sec. 6.1.3), making RF overfit.

As to maximum tree height, allowing the constructed features to be bigger (h=4 vs h=2) moderately improves the performance. Interestingly, GP-GOMEART with h=4 reaches competitive performance with SGP on all ML algorithms, despite the latter having no strict limitation on feature size.

6.1.2 Regression

Results on the regression tasks are shown in Figure 4, dataset-wise aggregated for LR, SVM, and RF. We report the results in terms of coefficient of determination, i.e., R2(y,y¯)=1-MSE(y,y¯)/var(y). For the three ML algorithms, results overall follow the same pattern. SGP is typically better, although constructing more features reduces the performance gap with the other GP algorithms. GP-GOMEART is slightly, yet consistently, the best performing within the maximum tree height limitation of 2, while SGPb is visibly preferable only when a single feature is constructed for LR and SVM, for h=4. Differently from the classification case, two features are typically enough to reach the performance of the original feature set for all ML algorithms. This is possibly due to the particular function set used in these experiments.

As for classification, allowing for larger trees results in better performance overall, and reduces the gap between SGP and the other GP algorithms.

6.1.3 Feature size

Figure 5 shows the aggregated feature size for the different GP algorithms and RS. The aggregated solution size is computed by taking the median solution size per run, then averaging over datasets, and finally averaging over ML algorithms (classification and regression are considered together). The picture shows how, overall, the known SGP tendency to bloat differs compared to the algorithms working with a strict tree height limitation. SGP features are so large that it is nearly impossible to interpret them (see Sec. 7.1).

RS finds the smallest features for both height limits h=2 and h=4. Considering that GP-GOMEART and SGPb generate trees within the same height bounds of RS, we conclude that it is the variation operators that allow finding larger trees with improved fitness within the height limit. GP-GOMEART seems to construct slightly, yet consistently, larger trees than SGPb.

For SGP, it can be seen that subsequently constructed features are smaller (this is barely visible for GP-GOMEART and SGPb as well). This is interesting because we do not use any mechanism to promote smaller trees. This result is likely linked to the diminishing returns in performance observed in Figure 3 and 4: constructing new complex and informative features becomes harder with the number of FCS iterations.

Feature size

Figure 5: Aggregated feature size for k=1,,5. Solid (dotted) lines represent solution size for maximum tree height h=2 (h=4). Shaded areas represent standard deviation. SGP is free to grow solutions up to h=17.

6.2 Statistical Significance: Comparing GP Algorithms

The aggregated results of Section 6.1 show moderate differences between GP-GOMEART and SGPb. These are arguably the most interesting algorithms to compare in-depth, as they are able to construct small features that lead to good performance (RS typically constructs less informative features, while SGP constructs very large ones).

We perform statistical significance tests to compare GP-GOMEART and SGPb. We consider their median performance on the test set Te, obtained by the FCS, and also compare it with the use of the original feature set, for each ML algorithm and each dataset. In our case, the subjects of our significance tests are the two search algorithms (i.e., GP-GOMEART and SGPb) and the original feature set, while the treatments are the configurations given by pairing ML algorithms and datasets demvsar2006statistical .

We first perform a Friedman test, and then pairwise Wilcoxon signed rank tests, paired by treatment (ML algorithm-dataset combination) demvsar2006statistical . We adopt the Holm correction method to account for false positive results due to chance holm1979simple .

We consider both h=2 and h=4, and focus on K=2, since consideration of only two constructed features makes interpretation easier, and allows human visualization (see Sec. 7.1).

6.2.1 Classification

For both h=2,4, the Friedman tests strongly indicates differences between GP-GOMEART, SGPb, and the use of the original feature set (p-value0.05).

Figure 6 (top) shows the Holm-corrected p-values obtained by the pairwise Wilcoxon tests for classification, where the alternative hypothesis is that the row allows for larger macro F1 scores than the column. No significant differences between GP-GOMEART and SGPb are found for both h=2,4, although both lead to generally better performance than use of the original feature set. The latter result appears to be in contrast with the results from Fig. 3 for SVM and RF, where it can be seen that the construction of only two features does, on average, lead to slightly worse results than using the original feature set. Nonetheless, the opposite is true for NB, and with larger magnitude. Considering these three ML algorithms together on all 10 datasets, there are more cases where using two evolved features is similarly good or better than using the original feature set, ultimately leading to the positive outcome of the statistical significance test.

The p-value for GP-GOMEART is barely larger than 0.05 for h=2 (due to Holm’s correction), while the one of SGPb is slightly smaller. SGPb seems preferable in this case. However, we found that GP-GOMEART’s training performance is significantly better than the one of SGPb (corrected-p-value0.01). This means that GP-GOMEART actually constructed better features w.r.t. the training set, which resulted in a slight overfitting at test time.

For h=4, again no significant difference is found between GP-GOMEART and of SGPb for FCS. Both search algorithms enable significantly better performance than the use of the original feature set, with p-values very close to 0.05.

6.2.2 Regression

As for classification datasets, the Friedman tests indicates that differences are presents between the subjects. Figure 6 (bottom) shows the Holm-corrected p-values obtained by the pairwise Wilcoxon tests for regression.

The statistical tests confirm the hypothesis that the algorithms are capable of providing constructed features that are more informative than the original feature set, as observed in Fig. 4 for the regression datasets. Now, GP-GOMEART is significantly better than SGPb when h=2. As found for classification, GP-GOMEART also here leads to better performance than SGPb on the training set (corrected-p-value0.01), however, this time this is reflected at the test time, meaning than no relevant overfitting is happening. For h=4, GP-GOMEART is not found to be significantly better than SGPb.

Classification

Regression

h=2

h=4

Figure 6: Holm-corrected p-values of pairwise Wilcoxon tests on test performance. Rows are tested to be significantly better than columns. Orig stands for the original feature set.

6.3 Statistical Significance: Two Constructed Features vs. the Original Feature Set per ML Algorithm

Results presented in Sec. 6.1 indicate that our FCS brings most benefit if used with the weak ML algorithms. We now report, for each ML algorithm, on how many datasets 2 features constructed using GP-GOMEA (with h=2 and h=4) lead to statistically significantly (using Holm-corrected pairwise Wilcoxon test) better, equal, or worse results compared to using the original feature set on the test set. This is shown in Table 4.

These results confirm what seen in Figures 3 and 4. Using FCS typically outperforms the use of the original feature set for the weak ML algorithms. For the strong ML algorithms, in most cases, using the original feature set is preferable. However, for some datasets reducing the space to two compact features without compromising performance is still possible.

The use of the original feature set is hardest to beat for RF. In the regression case with h=4, FCS brings benefits on the datasets Airfoil, Energy Cooling, Energy Heating, and Yacht; and performs on par with the use of the original feature set on the datasets Boston Housing and Concrete. These datasets are the ones with the smallest number of original features. We find similar results for SVM. It is reasonable to expect that FCS works well when few features can be combined.

In the classification case, findings are different. For RF and h=4, the datasets where using two constructed features bring similar or better results than using the original feature set are Breast Cancer Wisconsin and Iris. The latter does have a small number of original features (4), but the former has more than several other datasets (29). Furthermore, the datasets where FCS helps are different for SVM: FCS performs equally good to the original feature set on Iris and Cylinder Bands (39 features), and better on Madelon (500 features) and Image Segmentation (19 features). For classification datasets, we cannot conclude that a small cardinality of the original feature set is a good indication feature construction will work well. Furthermore, feature construction influences different ML algorithms in different ways.

Table 4: Number of Datasets Where Using 2 Features Constructed with GP-GOMEA Results in Significantly Better/Equal/Worse Performance Compared to Using the Original Feature Set on the Test Set.
h NB SVM RF

Class.

2 8/1/1 2/2/6 1/1/8
4 8/1/1 2/2/6 1/1/8
h LR SVM RF

Regr.

2 5/3/2 5/2/3 4/0/6
4 7/1/2 5/2/3 4/2/4

7 Results: improving interpretability

The results presented in Sec. 6 showed that the original feature set can be already outperformed by two small constructed features in many cases. We now aim at assessing whether constraining features size can enable interpretability of the features themselves, as well as if extra insight can be achieved by plotting and visualizing the behavior of a trained ML model in the new two-dimensional space.

7.1 Interpretability of Small Features

Table 5 shows some examples of features constructed by GP-GOMEART, for h=2 and h=4. We report the first feature constructed for the K=2 case, with median test performance. We show the first feature as it is typically not smaller than the second (see Fig. 5). Analytic quotients and protected logarithms are replaced by their respective definitions. We remark that we do not check whether the meaning of the features is sound (e.g., ensuring a certain unit of measure is returned). Constraining feature meaning is problem-dependent, and outside the scope of this work.

For classification, we choose NB as it is the method which benefits most from feature construction. The dataset considered is Cylinder Bands, where NB achieves the largest median test improvement: from F1=0.27 with the original set, to F1=0.60 with K=2, for both h=2 and h=4.

For regression, we consider LR on the Concrete dataset, for the same aforementioned reasons. The test R2 obtained with the original feature set is 0.59, the one with two features constructed by GP-GOMEART is 0.76 (0.78) for h=2 (h=4).

For h=2, we argue that constructed features are mostly easy to interpret. For example, the feature shown for LR on Concrete tells us that aging (x(8)) has a negative impact on concrete compressive strength, whereas using more water (x(4)) than cement (x(1)) has a positive effect (both features are in kg/cm3). The impact of other features is less important (within the data variability of the dataset). For h=4, some features can be harder to read and understand, however many are still accessible. This is mostly because, even though the total solution size reachable with h=4 is 31, constructed features are typically half the size (see Fig. 5).

Overall, we cannot draw a strict conclusion on whether the features found by our approach are interpretable, as interpretability is a subjective matter and no clear-cut metric exists lipton2018mythos ; guidotti2018survey . Yet, it appears evident that enforcing a restriction on their size is a necessary condition. We generally find that features using 15 or more nodes start to be hard to interpret w.r.t. our experimental settings, i.e., using our function set.Lastly, features constructed without a strict size limitation (by SGP) are in general very large, and thus extremely hard to understand. As an example, Figure 7 shows the first of the two features constructed by SGP for NB on Cylinder Bands (this is smaller than the first feature found by SGP for LR on Concrete).

((((logp(((x(31))2-logp((x(24)÷((logp(logp(x(3)+x(31)))×((x(3)×x(18))2×
(((x(24)×x(2))+x(31))×x(3)+logp(x(3)+x(34))÷((x(3)×x(18)+x(14)+
x(34))÷x(14)))))+(x(24)÷(((((x(6)÷(x(5))2)÷x(3))+x(3))×x(8))+
x(31))))))))×x(5)+(x(3)×x(18)+x(31))÷x(31)))2×x(9)))12
Figure 7: Example of a relatively “small” feature constructed by SGP, consisting of approximately 70 symbols (tree nodes). Note that the analytic quotient operator ÷ has not been expanded to its definition. This feature is arguably very hard to interpret.
Table 5: Examples of Features Constructed by GP-GOMEART with h{2,4}, K=2, for NB on Cylinder Bands, and for LR on Concrete.
h 1st Feature

NB

2 (x(27)+x(29))/1+(x(9)x(13))2
4 x(5)x(10)x(13)-log|x(9)|

LR

2 x(4)-x(1)+932.204/1+(x(8))2
4 19.764log|x(8)|+x(2)+2x(1)/1+(x(4))2

7.2 Visualizing What the ML Algorithm Learns

The construction of a small number of interpretable features can enable a better understanding of the problem and of the learned ML models. The case where up to two features are constructed is particularly interesting, since it allows visualization.

We provide one example of classification boundaries and one of a regressed surface, inferred by SVM on a two dimensional feature space obtained with our approach using GP-GOMEART.

The classification dataset on which we find the best test improvement for h=4 is Image Segmentation, where the F1 score of SVM reaches 0.88, against 0.65 using the original feature set (median run). Figure 8 shows the classification boundaries learned by SVM. The analytic quotient operator ÷ and the protected log logp are replaced by their definition for readability. The constructed features are rather complex here, yet readable. At the same time, it can be clearly seen how the training and test examples are distributed in the 2D space, and what classification boundaries SVM learned.

For regression, Figure 1 shows the surface learned by SVM on Yacht (median run), where GP-GOMEART with h=2 constructs two features that lead to an R2 of 0.98, against 0.85 obtained using the original feature set. The features are arguably easy to interpret, while it can be seen that the learned surface accurately models most of the data points.

Figure 8: Classification boundaries learned by SVM with two features constructed by GP-GOMEART (h=4) on the Image Segmentation dataset. The run with median test performance is shown. Circles are training samples, diamonds are test samples.
Table 6: Mean Serial Running Time to Construct Five Features Using GP-GOMEART with Maximum Tree Height 4 on the Smallest and Largest Datasets.
Dataset Size NB/LR SVM RF

Clas.

Iris 150×4 7 s 2 m 25 m
Madelon 2600×500 4 m 14 h 8 h

Regr.

Yacht 308×7 8 s 4 m 1 h
Tower 4999×26 2 m 34 h 34 h

8 Running Time

Our results are made possible by evaluating the fitness of constructed features with cross-validation, a procedure which is particularly expensive. Table 6 shows the (mean over 30 runs) serial running time to construct five features on the smallest and largest classification and regression datasets, using GP-GOMEART with h=4 and the parameter settings of Sec. 5, on the relatively old AMD Opteron™ Processor 6386 SE33 3 http://cpuboss.com/cpu/AMD-Opteron-6386-SE. Running time has a large variability, from seconds to dozens of hours, depending on dataset size and ML algorithm. For the datasets and ML algorithms we considered, it can be argued that our approach can be used in practice. However, for datasets with many more features and examples (e.g., image recognition datasets such as CIFAR krizhevsky2009learning ), and much more time-intensive ML algorithms (e.g., GP itself, or deep neural networks lecun2015deep ), to use FCS may not be feasible.

As to memory occupation, it basically mostly depends on the way the chosen ML algorithm handles the dataset. Our runs required at most few hundreds of MBs when dealing with the larger datasets, for SVM and RF.

9 Discussion

We believe this is one of the few works on evolutionary feature construction where the focus is put on both improving the performance of an ML algorithm, and on human interpretability at the same time. The interpretability we aimed for is twofold: understanding the meaning of the features themselves, as well as reducing their number. GP algorithms are key, as they can provide features as interpretable expressions given basic functional components, and a complexity limit (e.g., tree height).

We have run a large set of experiments, totaling more than 100,000 cpu-hours. Our results strongly support the hypothesis that the original feature set can be replaced by few (even solely K=2) features built with our FCS without compromising performance in many cases. In some cases, performance even improved. GP-GOMEART and SGPb achieve this result while keeping the constructed feature size extremely limited (h=2,4). SGP leads to slightly better performance than GP-GOMEART and SGPb, but at the cost of constructing five to ten times larger features. RS proved to be less effective than the GP algorithms.

Our FCS is arguably most sensible to use for simpler ML algorithms, such as NB and LR. Constructed features change the space upon which the ML algorithm operates. SVM already includes the kernel trick to change the feature space. Similarly, the trees of RF effectively embody complex non-linear feature combinations to explain the variance in the data. NB and LR, instead, do not include such mechanisms. Rather, they have particular assumptions on how the features should be combined (NB assumes normality, LR linearity). The features constructed by GP can transform the input the ML algorithm operates upon, to better fit its assumptions.

We found that performance was almost always significantly better than compared to using the original feature set for NB and LR. As running times for these ML algorithms can be in the order of seconds or minutes (Sec. 8), feature construction has the potential to be routinely used in data analysis and machine learning practice. Furthermore, FCS (or a modification where the constructed features are added to the original set) can be used as an alternative way to tune simple ML algorithms which have limited or no hyper-parameters.

Regarding the comparison between the search algorithms, GP-GOMEART does typically result in better training performance than SGPb (especially for h=2,K=2), and this is also found at test time for regression, but not for classification. We believe that significantly better results can be achieved if bigger population sizes and larger evaluations budgets can be employed (we kept the population size limited due to the computational expensiveness of SVM and RF).

Particularly for GP-GOMEA, previous work has shown that having sufficiently large population sizes enables the possibility to exploit linkage estimation and perform better-than-random mixing virgolin2019model ; virgolin2017scalable . To validate this also within the framework of our proposed FCS, we scaled the population size and the budget of fitness evaluations, and compared the use of the LT with the use of the RT, on the largest classification dataset, Madelon, using NB. The outcome is shown in Figure 9: the employment of big-enough population sizes (and of sufficient numbers of fitness evaluation) indeed leads to better performance. All in all, we recommend the use of GP-GOMEA as feature constructor since it was not worse on classification and was statistically better for regression. Furthermore, we advice to use the LT if the population size can be of the order of thousands or more (or even better, if exponential population sizing schemes are used as in virgolin2019model ; virgolin2017scalable ). Otherwise, the RT should be preferred.

Figure 9: Comparison between the use of the RT and of the LT in GP-GOMEA. Vertical axis: median F1 score of 30 runs, obtained by NB on Madelon using five constructed features with h=4. Horizontal axis: population size / fitness evaluations budget. Stars indicate significant superiority of one method w.r.t. the other.

To assess if small constructed features are interpretable and if it is possible to visualize what the behavior of learned ML models, we showed some examples, providing evidence that both requirements can be reasonably satisfied. We did not perform a thorough study on interpretability of constructed features. A proper evaluation of interpretability should likely involve targeted user studies. Experts of a field could give feedback on features constructed for datasets they are knowledgeable about. Nonetheless, enforcing features (and GP programs in general) to be small is a necessary condition to allow interpretability, and is typically ignored in literature virgolin2019model .

Considering the visualization examples proposed in Section 7, it is natural to compare our approach with well-known dimensionality reduction techniques, such as Principal Component Analysis (PCA) wold1987principal or t-Distributed Stochastic Neighbor Embedding (t-SNE) maaten2008visualizing . We remark that those techniques and our FCS have very different objectives. In general, the sole aim of such techniques is to reduce the data dimensionality. PCA does so by detecting components that capture maximal variance. However, it does not attempt to optimize the transformation of the original feature set to improve an ML algorithm’s performance. Also, PCA does not focus on the interpretability of the feature transformations. FCS takes the performance of the ML algorithm and interpretability of the features into account, while dimensionality reduction comes from forcing the construction of few features. We compared using 2 features constructed with RS (the worst search algorithm) with maximum h=2, with using the first 2 PCs found by PCA. The use of constructed features over PCs resulted in significantly superior test performance for all ML algorithms and for all problems. We remark, however, that PCA is extremely fast and independent from the ML algorithm.

Our FCS has several limitations. A first limitation regards the performance obtainable by the ML algorithm using the constructed features. FCS is iterative, and this can lead to suboptimal performance for a chosen K, compared to attempting to find K features at once. This is because the contributions of multiple features to an ML algorithm are not necessarily perpendicular to each other tran2019genetic . FCS could be changed to find at any given iteration, a synergistic set of K features, that is independent from previous iterations. To this end, larger population sizes need to be employed, and the search algorithms need to be modified so that they can evolve sets of constructed features (a similar proposal for SGP was done in krawiec2002genetic ). Yet, it is reasonable to expect that if K features need to be learned at the same time, larger population sizes may be needed compared to learning the K features iteratively.

Another limitation of this work is that hyper-parameter tuning was not considered. To include hyper-parameter tuning within FCS could bring even higher performance scores, or help prevent overfitting. A possibility could be, for example, to evolve pairs of features and hyper-parameter settings, where every time a feature is evaluated, the optimal hyper-parameters are also searched for. Such a procedure may likely require strong parallelization efforts, as C-fold cross-validation should be carried out for each combination of hyper-parameter values.

Lastly, it would be interesting to extend our approach to other classification and regression settings, e.g., problems with missing data; or to unsupervised tasks, as simple features may lead to better clustering of the examples.

10 Conclusion

With a simple evolutionary feature construction framework we have studied the feasibility of constructing few crucial and compact features with Genetic Programming (GP), towards improving the explainability of Machine Learning (ML) models without losing prediction accuracy. Within the proposed framework, we compared standard GP, random search, and the GP adaptation of the Gene-pool Optimal Mixing Evolutionary Algorithm (GP-GOMEA) as feature constructors, and found that GP-GOMEA is overall preferable when strict limitations on feature size are enforced. Despite limitations on feature size, and despite the reduction of problem dimensionality that we imposed by constructing only two features, we obtained equal or better ML prediction performance compared to using the original feature set for more than half the combinations of datasets and ML algorithms. In many cases, humans can understand what the feature means, and it is possible to visualize how trained ML models will behave. All in all, we conclude that feature construction is most useful and sensible for simpler ML algorithms, where more resources can be used for evolution (e.g., larger population sizes), which, in turn, unlock the added benefits of more advanced evolutionary mechanisms (e.g., using linkage learning in GP-GOMEA).

Acknowledgments

The authors acknowledge the Kinderen Kankervrij foundation for financial support (project #187). The majority of the computations for this work were performed on the Lisa Compute Cluster with the support of SURFsara.

References

  • (1) H. Liu, H. Motoda, Feature extraction, construction and selection: A data mining perspective, Vol. 453, Springer Science & Business Media, 1998.
  • (2) J. Friedman, T. Hastie, R. Tibshirani, The elements of statistical learning, Vol. 1, Springer series in statistics New York, NY, USA:, 2001.
  • (3) J. R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection, MIT Press, Cambridge, MA, USA, 1992.
  • (4) R. Poli, W. B. Langdon, N. F. McPhee, J. R. Koza, A field guide to genetic programming, Lulu. com, 2008.
  • (5) J. H. Friedman, Multivariate adaptive regression splines, The annals of statistics (1991) 1–67.
  • (6) D. W. Hosmer Jr, S. Lemeshow, R. X. Sturdivant, Applied logistic regression, Vol. 398, John Wiley & Sons, 2013.
  • (7) Z. C. Lipton, The mythos of model interpretability, Queue 16 (3) (2018) 30:31–30:57.
  • (8) R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi, A survey of methods for explaining black box models, ACM computing surveys (CSUR) 51 (5) (2018) 93.
  • (9) A. Adadi, M. Berrada, Peeking inside the black-box: A survey on explainable artificial intelligence (xai), IEEE Access 6 (2018) 52138–52160.
  • (10) B. Goodman, S. Flaxman, European union regulations on algorithmic decision-making and a “right to explanation”, AI Magazine 38 (3) (2017) 50–57.
  • (11) M. Virgolin, T. Alderliesten, C. Witteveen, P. A. N. Bosman, Improving model-based genetic programming for symbolic regression of small expressions, CoRR abs/1904.02050. arXiv:1904.02050.
  • (12) B. Xue, M. Zhang, W. N. Browne, X. Yao, A survey on evolutionary computation approaches to feature selection, IEEE Transactions on Evolutionary Computation 20 (4) (2016) 606–626.
  • (13) K. Krawiec, Genetic programming-based construction of features for machine learning and knowledge discovery tasks, Genetic Programming and Evolvable Machines 3 (4) (2002) 329–343.
  • (14) L. Breiman, Classification and regression trees, Routledge, 2017.
  • (15) M. Muharram, G. D. Smith, Evolutionary constructive induction, IEEE Transactions on Knowledge and Data Engineering 17 (11) (2005) 1518–1528.
  • (16) B. Tran, B. Xue, M. Zhang, Genetic programming for feature construction and selection in classification on high-dimensional data, Memetic Computing 8 (1) (2016) 3–15.
  • (17) N. S. Altman, An introduction to kernel and nearest-neighbor nonparametric regression, The American Statistician 46 (3) (1992) 175–185.
  • (18) S. J. Russell, P. Norvig, Artificial intelligence: a modern approach, Malaysia; Pearson Education Limited,, 2016.
  • (19) K. P. Murphy, Naive Bayes classifiers, University of British Columbia 18.
  • (20) Q. Chen, M. Zhang, B. Xue, Genetic programming with embedded feature construction for high-dimensional symbolic regression, in: Intelligent and Evolutionary Systems, Springer, 2017, pp. 87–102.
  • (21) A. Cano, S. Ventura, K. J. Cios, Multi-objective genetic programming for feature extraction and data visualization, Soft Computing 21 (8) (2017) 2069–2089.
  • (22) M. Virgolin, T. Alderliesten, A. Bel, C. Witteveen, P. A. N. Bosman, Symbolic regression and feature construction with GP-GOMEA applied to radiotherapy dose reconstruction of childhood cancer survivors, in: Genetic and Evolutionary Computation (GECCO 2018), ACM, 2018, pp. 1395–1402.
  • (23) B. Tran, B. Xue, M. Zhang, Genetic programming for multiple-feature construction on high-dimensional classification, Pattern Recognition 93 (2019) 404–417.
  • (24) C. Cortes, V. Vapnik, Support-vector networks, Machine learning 20 (3) (1995) 273–297.
  • (25) L. Breiman, Random forests, Machine learning 45 (1) (2001) 5–32.
  • (26) R. Kohavi, G. H. John, Wrappers for feature subset selection, Artificial intelligence 97 (1-2) (1997) 273–324.
  • (27) M. Virgolin, T. Alderliesten, C. Witteveen, P. A. N. Bosman, Scalable genetic programming by gene-pool optimal mixing and input-space entropy-based building-block learning, in: Genetic and Evolutionary Computation (GECCO 2017), ACM, New York, NY, USA, 2017, pp. 1041–1048.
  • (28) T. P. Pawlak, B. Wieloch, K. Krawiec, Semantic backpropagation for designing search operators in genetic programming, IEEE Transactions on Evolutionary Computation 19 (3) (2015) 326–340.
  • (29) R. R. Curtin, J. R. Cline, N. P. Slagle, W. B. March, P. Ram, N. A. Mehta, A. G. Gray, MLPACK: A scalable C++ machine learning library, Journal of Machine Learning Research 14 (2013) 801–805.
  • (30) C.-C. Chang, C.-J. Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology 2 (2011) 27:1–27:27, software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
  • (31) M. N. Wright, A. Ziegler, ranger: A fast implementation of random forests for high dimensional data in C++ and R, arXiv preprint arXiv:1508.04409.
  • (32) J. Ni, R. H. Drieberg, P. I. Rockett, The use of an analytic quotient operator in genetic programming, IEEE Transactions on Evolutionary Computation 17 (1) (2013) 146–152.
  • (33) D. R. White, J. Mcdermott, M. Castelli, L. Manzoni, B. W. Goldman, G. Kronberger, W. Jaśkowski, U.-M. O’Reilly, S. Luke, Better GP benchmarks: community survey results and proposals, Genetic Programming and Evolvable Machines 14 (1) (2013) 3–29.
  • (34) J. Albinati, G. L. Pappa, F. E. Otero, L. O. V. Oliveira, The effect of distinct geometric semantic crossover operators in regression problems, in: European Conference on Genetic Programming, Springer, 2015, pp. 3–15.
  • (35) J. Demšar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (2006) 1–30.
  • (36) S. Holm, A simple sequentially rejective multiple test procedure, Scandinavian journal of statistics (1979) 65–70.
  • (37) A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny images, Tech. rep., Citeseer (2009).
  • (38) Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436.
  • (39) S. Wold, K. Esbensen, P. Geladi, Principal component analysis, Chemometrics and intelligent laboratory systems 2 (1-3) (1987) 37–52.
  • (40) L. v. d. Maaten, G. Hinton, Visualizing data using t-sne, Journal of Machine Learning Research 9 (2008) 2579–2605.