Abstract
Feature construction can substantially improve the accuracy of MachineLearning (ML) algorithms. Genetic Programming (GP) has been proven to beeffective at this task by evolving nonlinear combinations of input features.GP additionally has the potential to improve ML explainability since explicitexpressions are evolved. Yet, in most GP works the complexity of evolvedfeatures is not explicitly bound or minimized though this is arguably key forexplainability. In this article, we assess to what extent GP still performsfavorably at feature construction when constructing features that are (1) Ofsmallenough number, to enable visualization of the behavior of the ML model;(2) Of smallenough size, to enable interpretability of the featuresthemselves; (3) Of sufficient informative power, to retain or even improve theperformance of the ML algorithm. We consider a simple feature constructionscheme using three different GP algorithms, as well as random search, to evolvefeatures for four ML algorithms, including support vector machines and randomforest. Our results on 20 datasets pertaining to classification and regressionproblems show that constructing only two compact features can be sufficient torival the use of the entire original feature set. We further find that a modernGP algorithm, GPGOMEA, performs best overall. These results, combined withexamples that we provide of readable constructed features and of 2Dvisualizations of ML behavior, lead us to positively conclude that GPbasedfeature construction still works well when explicitly searching for compactfeatures, making it extremely helpful to explain ML models.
Quick Read (beta)
On Explaining Machine Learning Models by Evolving Crucial and Compact Features
Abstract
Feature construction can substantially improve the accuracy of Machine Learning (ML) algorithms. Genetic Programming (GP) has been proven to be effective at this task by evolving nonlinear combinations of input features. GP additionally has the potential to improve ML explainability since explicit expressions are evolved. Yet, in most GP works the complexity of evolved features is not explicitly bound or minimized though this is arguably key for explainability. In this article, we assess to what extent GP still performs favorably at feature construction when constructing features that are (1) Of smallenough number, to enable visualization of the behavior of the ML model; (2) Of smallenough size, to enable interpretability of the features themselves; (3) Of sufficient informative power, to retain or even improve the performance of the ML algorithm. We consider a simple feature construction scheme using three different GP algorithms, as well as random search, to evolve features for four ML algorithms, including support vector machines and random forest. Our results on 20 datasets pertaining to classification and regression problems show that constructing only two compact features can be sufficient to rival the use of the entire original feature set. We further find that a modern GP algorithm, GPGOMEA, performs best overall. These results, combined with examples that we provide of readable constructed features and of 2D visualizations of ML behavior, lead us to positively conclude that GPbased feature construction still works well when explicitly searching for compact features, making it extremely helpful to explain ML models.
keywords:
feature construction, interpretable machine learning, genetic programming, GOMEAshapes.geometric, arrows \tikzstylesquarered = [rectangle, minimum width=3cm, minimum height=1cm, text centered, draw=black, fill=red!10] \tikzstylesquaregreen = [rectangle, minimum width=3cm, minimum height=1cm, text centered, draw=black, fill=green!10] \tikzstylesquareblue = [rectangle, minimum width=3cm, minimum height=1cm, text centered, draw=black, fill=blue!10] \tikzstylediamondgreen = [diamond, minimum width=3cm, minimum height=1cm, text centered, draw=black, fill=green!10] \tikzstylediamondblue = [diamond, minimum width=3cm, minimum height=1cm, text centered, draw=black, fill=blue!10] \tikzstylediamondred = [diamond, minimum width=3cm, minimum height=1cm, text centered, draw=black, fill=red!10] \tikzstylediamondorange = [diamond, minimum width=3cm, minimum height=1cm, text centered, draw=black, fill=orange!10] \tikzstylecirclewhite = [circle, minimum width=0.5cm, minimum height=0.5cm, text centered, draw=black] \tikzstylecircleyellow = [circle, minimum width=1cm, minimum height=1cm, text centered, draw=black, fill=yellow!10] \tikzstylesquareryellow = [rectangle, rounded corners, minimum width=3cm, minimum height=1cm, text centered, draw=black, fill=yellow!10] \tikzstyleinvisible = [rectangle, minimum width=0.5cm, minimum height=0.5cm, text centered] \tikzstylearrow = [thick,¿,¿=stealth] \tikzstyledarrow = [dashed,¿,¿=stealth] \tikzstyleline = [thick,,¿=stealth] \algrenewcommand\alglinenumber[1]#1
1 Introduction
Feature selection and feature construction are two important steps to improve the performance of any Machine Learning (ML) algorithm liu1998feature ; friedman2001elements . Feature selection is the task of excluding features that are redundant or misleading. Feature construction is the task of transforming (parts of) the original feature space into one that the ML algorithm can better exploit.
A very interesting method to perform feature construction automatically is Genetic Programming (GP) koza1992gp ; poli2008field . GP can synthesize functions without many prior assumptions on their form, differently from, e.g., logistic regression or regression splines friedman1991multivariate ; hosmer2013applied . Moreover, feature construction not only depends on the data at hand, but also on the way a specific ML algorithm can model that data. Evolutionary methods in general are highly flexible in their use due to the way they perform search (i.e., derivative free). This makes it possible, for example, to evaluate the quality of a feature for a specific ML algorithm by directly measuring what its impact is on the performance of the ML algorithm (i.e., by training and validating the ML algorithm when using that feature).
Explaining what constructed features mean can shed light on the behavior of MLinferred models that use such features. Reducing the number of features is also important to improve interpretability. If the original feature space is reduced to few constructed features (e.g., up to two for regression and up to three for classification), the function learned by the ML model can be straightforwardly visualized w.r.t. the new features. In fact, how to make ML models more understandable is a key topic of modern ML research, as many practical, sensitive applications exist, where explaining (part of) the behavior of ML models is essential to trust their use (e.g., in medical applications) lipton2018mythos ; guidotti2018survey ; adadi2018peeking ; goodman2017european . Typically, GP for feature construction searches in a subspace of mathematical expressions. Adding to the appeal and potential of GP, these expressions can be humaninterpretable if simple enough guidotti2018survey ; virgolin2019model .
Figure 1 presents an example of the potential held by such an approach: a multidimensional dataset transformed into a 2D one, where both the behavior of the ML algorithm and the meaning of the new features is clear, while the performance of the ML algorithm is not compromised w.r.t. the use of the original feature set (it is actually improved).
In this article we study whether GP can be useful to construct a low number of small features, to increase the chance of obtaining interpretable ML models, without compromising their accuracy (compared to using the original feature set). To this end, we design a simple, iterative feature construction scheme, and perform a wide set of experiments: we consider four types of feature construction methods (three GP algorithms and random search), four types of machine learning algorithms. We apply their combinations on 20 datasets between classification and regression to determine to what extent they are capable of effectively and efficiently finding crucial and compact features for specific ML algorithms.
The remainder of this article is organized as follows. Related work is reported in Section 2. The proposed feature construction scheme is presented in Section 3. The search algorithms to construct features, as well as the considered ML algorithms, are presented in Section 4. The experimental setup is described in Section 5. Results related to performance are reported in Section 6, while results concerning interpretability are reported in Section 7. Section 9 discusses our findings, and Section 10 concludes this article.
2 Related Work
Since this article focuses on feature construction, we report here related work that at least includes construction of new features. For contributions that focus exclusively on feature selection, the reader is referred to a recent survey xue2016survey .
One of the first approaches is presented in krawiec2002genetic , where each GP solution is a set of $K$ features. The fitness of a set is the crossvalidation performance of a decision tree breiman2017classification using that set. The results on six classification datasets show that the approach is able to synthesize a feature set that is competitive with the original one, and can also be added to the original set for further improvements. No attention is however given to the interpretability of evolved features.
The work in muharram2005evolutionary generates one feature with Standard, treebased GP (SGP) koza1992gp , to be added to the original set. Feature importance metrics of decision trees such as information gain, Gini index and $Ch{i}^{2}$ are used as fitness measure. An advantage of using such fitness measures over ML performance is that they can be computed very quickly. However, they are decision treespecific. Results show that the approach can improve prediction accuracy, and, for a few problems, it is shown that decision trees that are simple enough to be reasonably interpretable, can be found.
Feature construction for highdimensional datasets is considered in tran2016genetic , for eight biomedical binary classification problems, with 2,000 to 24,188 features. This approach is different from the typical ones, as the authors propose to use SGP to evolve classifiers rather than features, and extract features from the components (subtrees) of such classifiers. These are then used as new features for an ML algorithm. Results on KNearest Neighbors altman1992introduction , Naive Bayes classifier russell2016artificial ; murphy2006naive , and decision tree show that a sofound feature set can be competitive or outperform the original one. The authors show an example where a single interpretable feature is constructed that enables linear separation of the classification examples.
Different from the aforementioned works, chen2017genetic explores feature construction for regression. A SGPbased approach is designed to tackle regression problems with a large number of features, and is tested on six datasets. Instead of using the constructed features for a different ML algorithm, SGP dynamically incorporates them within an ongoing run, to enrich the terminal set. Every $\alpha $ generations of SGP, the subtrees composing the best solutions become new features by encapsulation into new terminal nodes. The approach is found to improve the ability of SGP to find accurate solutions. However, the features found by encapsulating subtrees are not interpretable because allowing subsequent encapsulations leads to an exponential growth of solution size.
A recent work that focuses on evolutionary dimensionality reduction and consequent visualization is cano2017multiobjective , where a multiobjective, grammarbased SGP approach is employed. $K$ feature transformations are evolved in synergy to enable, at the same time, good classification accuracy, and visualization through dimensionality reduction. The system is thoroughly tested on 42 classification tasks, showing that the algorithm performs well compared to stateoftheart dimensionality reduction methods, and it enables visualization of the learned space. However, as trees are free to grow up to a height of 50, the constructed features themselves cannot be interpreted.
The most similar works to ours that we found are virgolin2018gecco and tran2019genetic . In virgolin2018gecco , which is our previous work, the possibility of using a modern modelbased GP algorithm (which we also use in our comparisons) for feature construction is explored on four regression datasets. There, focus is put on keeping feature size small, to actively attempt to obtain readable features. These features are iteratively constructed to be added to the original feature set to improve the performance of the ML algorithm, and three ML algorithms are compared (linear regression, support vector machines cortes1995support , random forest breiman2001random ). Reducing the feature space to enable a better understanding of inferred ML models is not considered.
In tran2019genetic , different feature construction approaches are compared on geneexpression datasets that have a large number of features (thousands to tens of thousands) to study if evolving classdependent features, i.e., features that are each targeted at aiding the ML algorithm detect one specific class, can be beneficial. Similarly to us, the authors show visualizations of feature space reduced to up to three constructed features, and an example of three features that are encoded as very small, easytointerpret trees. However, such small features are a rare outcome as the trees used to encode features typically had more than 75 nodes. These trees are therefore arguably extremely hard to read and interpret.
Our work is different from previous research in two major aspects. First, none of the previous work principally addresses the conflicting objectives of retaining good performance of an ML algorithm while attempting to explain both its behavior (by dimensionality reduction to allow visualization), and the meaning of the features themselves (by constraining feature complexity). Second, multiple GP algorithms within a same feature construction scheme, on multiple ML algorithms, are not compared in previous work. Most of the times, it is a different feature construction scheme that is tested, using arguably small variations of SGP. Here, we consider random search, two versions of SGP, as well as another modern GP algorithm. Furthermore, we adopt both “weak” ML algorithms such as ordinary least squares linear regression and the naive Bayes classifier, as well as “strong”, stateoftheart ones, which are rarely used in literature for feature construction, such as support vector machine and random forest; on both classification and regression tasks.
3 Iterative Evolutionary Feature Construction
We use a remarkably simple scheme to construct features. Our approach constructs $K\in {\mathbb{N}}^{+}$ features by iterating $K$ GP runs. The evolution of the $k$th feature ($k\in \{1,\mathrm{\dots},K\}$) uses the previously constructed $k1$ features.
3.1 Feature Construction Scheme
The dataset $D$ defining the problem at hand is split into two parts: the training $Tr$ and the test $Te$ set. This partition is kept fixed through the whole procedure. Only $Tr$ is used to construct features, while $Te$ is exclusively used for final evaluation to avoid positive bias in the results kohavi1997wrappers . We use the notation ${x}_{j}^{(i)}$ to refer to the $i$th feature value of the $j$th example, and ${y}_{j}$ for the desired outcome (label for classification or target value for regression) of the $j$th example.
The $k$th GP run evolves the $k$th feature. An example is shown in Figure 2. Each solution in the population competes to become the new feature ${x}^{(k)}$, that represents a transformation of the original feature set. In every run, the population is initialized at random.
We evaluate the fitness of a feature of the $k$th run by measuring the performance of the ML algorithm on a dataset that contains that feature and the previously evolved $k1$ features.
We only use original features (and random constants) as terminals. In particular, the features constructed by previous iterations are not used as terminal nodes in the $k$th run. This prevents the generation of nested features, which could harm interpretability.
At the end of the $k$th run, the best feature is stored and its values ${x}_{j}^{(k)}$ are added to $Tr$ and $Te$ for the next iterations.
3.2 Feature Fitness
The fitness of a feature is computed by measuring the performance (i.e., error) of the ML algorithm when the new feature is added to $Tr$. We consider the $C$fold crossvalidation error rather than the training error to promote generalization and prevent overfitting. The pseudo code of the evaluation function is shown in Algorithm 1.
Specifically, the $C$fold crossvalidation error is computed by partitioning $Tr$ into $C$ splits. For each $c=1,\mathrm{\dots},C$ iteration, a different split is used for validation (set ${V}^{c}$), and the remaining $C1$ splits are used for training (set $T{r}^{c}$). The mean validation error is the final result.
For classification tasks, in order to take into account both multiple and possibly imbalanced class distributions, the prediction error is computed as 1 minus the macro $F1$ score, i.e., 1 minus the mean of the classspecific $F1$ scores:
$1F1$  $=1{\displaystyle \frac{1}{\mathrm{\#}\text{\mathit{c}\mathit{l}\mathit{a}\mathit{s}\mathit{s}\mathit{e}\mathit{s}}}}{\displaystyle \sum _{\gamma \in \text{\mathit{c}\mathit{l}\mathit{a}\mathit{s}\mathit{s}\mathit{e}\mathit{s}}}}F{1}_{\gamma}$  
$=1{\displaystyle \frac{2}{\mathrm{\#}\text{\mathit{c}\mathit{l}\mathit{a}\mathit{s}\mathit{s}\mathit{e}\mathit{s}}}}{\displaystyle \sum _{\gamma \in \text{\mathit{c}\mathit{l}\mathit{a}\mathit{s}\mathit{s}\mathit{e}\mathit{s}}}}{\displaystyle \frac{\frac{T{P}_{\gamma}}{T{P}_{\gamma}+F{P}_{\gamma}}\frac{T{P}_{\gamma}}{T{P}_{\gamma}+F{N}_{\gamma}}}{\frac{T{P}_{\gamma}}{T{P}_{\gamma}+F{P}_{\gamma}}+\frac{T{P}_{\gamma}}{T{P}_{\gamma}+F{N}_{\gamma}}}},$ 
where $T{P}_{\gamma},F{N}_{\gamma},F{P}_{\gamma}$ are the true positive, false negative, and false positive classifications for the class $\gamma $, respectively. If the computation of $F{1}_{\gamma}$ results in $\frac{0}{0}$, we set $F{1}_{\gamma}=0$.
For regression, the prediction error is computed with the Mean Squared Error (MSE).
[1] \FunctionComputeFeatureFitness$s$ \State$T{r}^{\prime}\leftarrow $AddFeatureToCurrentTrainingSet($s$) \State$error\leftarrow 0$ \For$c=1,\mathrm{\dots},C$ \State${T}^{c},{V}^{c}\leftarrow $SplitSet($c,C,T{r}^{\prime}$) \State$M\leftarrow $TrainMLModel(${T}^{c}$) \State$error\leftarrow error+$ComputeError($M,{V}^{c}$) \EndFor\StateReturn$\left(\frac{error}{C}\right)$ \EndFunction
3.3 Preventing Unnecessary Fitness Computations
Computing the fitness of a feature is particularly expensive, as it consists of a $C$fold crossvalidation of the ML algorithm. This limits the feasibility of, e.g., adopting large population sizes and large numbers of evaluations for the GP algorithms.
We therefore attempt to prevent unnecessary crossvalidation calls, by assessing if features meet four criteria. Let $n$ be the number of examples in $Tr$. The criteria are the following:

1.
The feature is not a constant. We avoid evaluating constant features as they are likely to be useless for many ML algorithms, which internally already compute an intercept.

2.
The feature does not contain extreme values that may cause numerical errors, i.e., with absolute value above a lowerbound ${\beta}_{\mathrm{\ell}}$ or above an upperbound ${\beta}_{u}$. Here, we set ${\beta}_{\mathrm{\ell}}={10}^{10}$, and ${\beta}_{u}={10}^{10}$ (none of the datasets considered here have values exceeding these bounds).

3.
The feature is not equivalent to one constructed in the previous $k\mathrm{}\mathrm{1}$ iterations. Equivalence is determined by checking the values available in $Tr$, i.e., equivalence holds if:
$$\forall j\in Tr,\exists i\in \{1,\mathrm{\dots},k1\}:{x}_{j}^{\left(k\right)}={x}_{j}^{\left(i\right)}.$$ Note that a constructed feature that is equivalent to a feature of the original feature set can be valid, as long as no other previously constructed feature exists that is already equivalent. Thus, our approach can in principle perform pure feature selection.

4.
The values (in $T\mathit{}r$) of the feature in consideration have changed since the last time the feature was evaluated. This is because GP variation is not guaranteed to change the output of solutions (e.g., a multiplication by 1 is inserted). This is realized by caching the latest feature values after evaluation.
The computational effort for each criterion is $O(n)$ (it is $O((k1)n)$ for criterion 3, however in our experiments $k\ll n$). The fitness of a feature failing criterion 1, 2, or 3 is set to the maximum possible error value. If criterion 4 fails, the fitness remains the same (although performing crossvalidation may lead to slightly different results when using stochastic ML algorithms like random forest).
4 Considered Search Algorithms and Machine Learning Algorithms
We consider SGP, Random Search (RS), and the GP instance of the Genepool Optimal Mixing Evolutionary Algorithm (GPGOMEA) as competing search algorithms to construct features. SGP is widely used in feature construction (see related work in Sec. 2). RS is not typically considered, yet we believe it is important to assess whether evolution does bring any benefit over random enumeration within the confines of our study, i.e., when forcing to find small features. GPGOMEA is a recently introduced GP algorithm that has proven to be particularly proficient in evolving accurate solutions of limited size virgolin2019model ; virgolin2017scalable ; virgolin2018gecco .
As ML algorithms, we consider the Naive Bayes Classifier (NBC), ordinary leastsquares Linear Regression (LR), Support Vector Machines (SVM), and Random Forest (RF). NBC is used only for classification tasks, LR only for regression tasks, SVM and RF for both tasks. We provide more details in the following sections.
4.1 Details on the Search Algorithms
All search algorithms use the fitness evaluation function. A feature $s$ is evaluated by first checking whether the four criteria of Section 3.3 are met, and then, if the outcome is positive, by running the ML algorithm over the featureextended dataset.
For SGP, we use subtree crossover and subtree mutation, picking the depth of subtree roots uniformly randomly as proposed in pawlak2015semantic . The candidate parents for variation are chosen with tournament selection. Since we are interested in constructing small features so as to increase the chances they will be interpretable, we consider two versions of SGP. The first is the classic one where solutions are free to grow to tree heights typically much larger than the one used for tree initialization. In the following, the notation SGP refers to this first version. The second one uses trees that are not allowed to grow past the initial maximum tree height. We call this version bounded SGP, and use the notation SGP_{b}.
RS is realized by continuously sampling and evaluating new trees, keeping the best koza1992gp . Like for SGP_{b}, a maximum tree height is fixed during the whole run. If evolution is hypothetically no better than RS, then we expect that SGPb and GPGOMEA will construct features that are no better than the ones constructed by RS.
GPGOMEA is a recently introduced GP algorithm that has been found to deliver accurate solutions of small size on benchmark problems virgolin2017scalable , and to work well when a small size is enforced in symbolic regression virgolin2019model ; virgolin2018gecco . GPGOMEA uses a tree template fixed by a maximum tree height (which can include intron nodes to allow for unbalanced tree shapes) and performs homologous variation, i.e., mixed tree nodes come from the same positions in the tree. Each generation prior to mixing, a hierarchical model that captures interdependencies (linkage) between nodes is built (using mutual information). This model, called Linkage Tree (LT), drives variation by indicating what nodes should be changed en block during mixing, to avoid the disruption of patterns with large linkage.
The LT has been shown to enable GPGOMEA to outperform subtree crossover and subtree mutation of SGP, as well as the use of a randomlybuild LT, i.e., the Random Tree (RT), on problems of different nature virgolin2019model ; virgolin2017scalable . However, the LT requires sufficiently large population sizes to be accurate and beneficial (e.g., several thousand solutions in GP for symbolic regression) virgolin2019model . Because in the framework of this article fitness evaluations use the crossvalidation of a ML algorithm, we cannot afford to use large population sizes. Accordingly, we found the adoption of the LT to not be superior to the adoption of the RT under these circumstances in preliminary experiments. Therefore, for the most part, we adopt GPGOMEA with the RT (GPGOMEA_{RT}). This means we effectively compare random hierarchical homologous variation with subtreebased variation. An example of adopting the LT and large population sizes for feature construction is provided in Section 9.
4.2 Details on the ML algorithms
We now briefly describe the ML algorithms used in this work: NBC, LR, SVM, and RF. NBC and LR are less computationally expensive compared to SVM and RF. In the following, $k$ is the number of features and $n$ is the number of examples in the training set $Tr$.
NBC is a classifier which assumes independence between features russell2016artificial ; murphy2006naive . NBC is often used as a baseline, as it is simple and fast to train. Training an NBC has computational complexity of $O(nk)$, while classifying a new example takes $O(k\times \mathrm{\#}\text{\mathit{c}\mathit{l}\mathit{a}\mathit{s}\mathit{s}\mathit{e}\mathit{s}})$. We use the mlpack implementation of NBC mlpack2013 and assume the data to be normally distributed (default setting).
Similarly to NBC, LR is often used as a baseline as it is simple and fast, for regression tasks. LR assumes that the target variable can be explained by a linear combination of the features russell2016artificial . The training computational complexity of LR is $O(n{k}^{2})$. Predicting the target variable for a new example requires $O(k)$. We use the mlpack implementation of LR mlpack2013 .
SVM is a powerful ML algorithm that can be used for classification and regression cortes1995support ; CC01a . A fundamental part of SVM is the kernel trick, which allows to project the features in a different space. We use the libsvm C++ implementation CC01a , which has a training complexity between $O({n}^{2}k)$ and $O({n}^{3}k)$ depending on the specific training set. Prediction complexity is $O(vk)$, with $v$ the number of support vectors, which is parameter dependent. We consider the Radial Basis Function (RBF) kernel, which works well in practice for many problems, with CSVM for classification, and $\mathcal{E}$SVM for regression.
RF is an ensemble ML algorithm which, like SVM, can be used for both classification and regression breiman2001random . RF works by building an ensemble of decision trees, each trained on a sample of the training set (bagging). At prediction time, the mean (or maximum agreement) prediction of the decision trees is returned. Assuming decision trees can grow up to height $O(\mathrm{log}n)$, the computation time for training is $O(\tau \mu n\mathrm{log}n)$, where $\tau $ is the number of decision trees and $\mu $ is the number of features considered for splitting nodes at decision tree construction, i.e., the mtry parameter (typically $\mu =\sqrt{k}$ for classification and $\mu =k/3$ for regression). The prediction takes $O(\tau \mathrm{log}n)$. We use the ranger C++ implementation wright2015ranger .
SGP(b)  GPGOMEA_{RT}  
Population size  $100$  $100$ 
Initialization method  Ramped Half and Half  Half and Half 
Initialization max tree height  $2\text{\u2013}6$ ($2$ or $4$)  $2$ or $4$ 
Max tree height  17 ($2$ or $4$)  $2$ or $4$ 
Variation  SX $0.9$, SM $0.1$  parameterless 
Selection  \makecellTournament 7, Elitism 1  parameterless 
Function set  $\{+,\times ,,\xf7,{\cdot}^{2},\sqrt{\cdot},{\mathrm{log}}_{p},\mathrm{exp}\}$ for all  
Terminal set  $\{{x}^{\left(i\right)},\text{\U0001d674\U0001d681\U0001d672}\}$ for all 
5 Experiments
We perform 30 runs of our Feature Construction Scheme (FCS), with SGP, SGP_{b}, RS, and GPGOMEA_{RT}, in combination with each ML algorithm (NB only for classification and LR only for regression), on each problem. Each run of the FCS uses a random traintest split of 80%20%, and considers up to $K=5$ features construction rounds. We use a population size of 100 for the search algorithms, and assign a maximum budget of $10,000$ function evaluations to each FCS iteration. This results in relatively large running times for complex ML algorithms (see Sec. 8). An experiment including larger evolutionary budgets and the use of the LT in GPGOMEA is presented in the discussion (Sec. 9). We use a limit on the total number of evaluations instead of a a limit on the total number of generations because GPGOMEA_{RT} performs more evaluations than SGP per generation virgolin2017scalable .
For GPGOMEA_{RT}, SGP_{b}, and RS, we consider two levels of maximum tree height $h$: 2 and 4. This choice yields a maximum solution size of 7 and 31 respectively (using function nodes with a maximum arity $r=2$). We choose these two height levels because we found features with $h=2$ to be arguably easy to read and interpret, whereas features with $h=4$ can already be very hard to understand. This indication is also reported in virgolin2019model for the evolution of symbolic regression formulas. Note that using a tree height limit over a solution size limit prevents finding deep trees containing the nesting of the arguably more complicated to understand nonlinear functions ${\cdot}^{2},\sqrt{\cdot},{\mathrm{log}}_{p},\mathrm{exp}$. We do not consider bigger tree heights as resulting features may likely be impossible to interpret, defying a key focus of this work.
Other parameter settings used for the GP algorithms are shown in Table 1. SGP_{b} uses the same settings as SGP, except for the maximum tree height (at initialization and along the whole run), which is set to the same of GPGOMEA_{RT}. In GPGOMEA_{RT} we use the Half and Half (HH) tree initialization method instead of the Ramped Half and Half (RHH) koza1992gp commonly used for SGP. This proved to be beneficial since GOM varies nodes instead of subtrees virgolin2019model ; virgolin2018gecco . For both HH and RHH, syntactical uniqueness of solutions is enforced for up to 100 tries koza1992gp . In GPGOMEA_{RT} we additionally avoid sampling trees having a terminal node as root by setting the minimum tree height of the grow method to 1. This is not done for SGP and SGP_{b}, because differently from GPGOMEA_{RT} where homologous nodes are varied, subtree root nodes for subtree crossover (SX) and subtree mutation (SM) are chosen uniformly randomly. RS samples new trees using the same initialization method as SGP_{b}, i.e., RHH.
The division operator $\xf7$ used in the function set is the analytic quotient operator ($a\xf7b=a/\sqrt{1+{b}^{2}}$), which was shown to lead to better generalization performance than protected division ni2013use . The logarithm is protected ${\mathrm{log}}_{p}(\cdot )=\mathrm{log}(\cdot )$ and ${\mathrm{log}}_{p}(0)=0$, and so is the square root operator. The terminal set contains the original feature set, and an Ephemeral Random Constant (ERC) poli2008field with values uniformly sampled between the minimum and maximum values of the features in the original training set, i.e., $[\mathrm{min}{x}_{j}^{(i)},\mathrm{max}{x}_{j}^{(i)}],\forall i,j\in Tr$.
The hyperparameter settings for the SVM and RF are shown in Table 2, and are mostly default breiman2001random ; CC01a ; wright2015ranger . NBC and LR implementations do not have hyperparameters.
We consider 10 classification and 10 regression benchmark datasets^{1}^{1} 1 The datasets are available at http://goo.gl/9D2z3b. Details on the datasets are reported in Table 3. Rows with missing values are omitted. Most datasets are taken from the UCI Machine Learning repository^{2}^{2} 2 http://archive.ics.uci.edu/ml/, with exception for Dow Chemical and Tower, which come from GP literature white2013better ; albinati2015effect .
SVM  
Kernel  RBF 
Cost  1 
Epsilon  0.1 
Tolerance  0.001 
Gamma  $\frac{1}{k}$ 
Shrinking  Active 
RF  
Number of trees  100 
Bagging sampling  with replacement 
Classification mtry  $\sqrt{\text{\#features}}$ 
Regression mtry  $\mathrm{min}(1,\frac{\text{\#features}}{3})$ 
Min node size  $1$ classification, $5$ regression 
Split rule  Gini classification, Variance regression 
Dataset  # Features  # Examples  # Classes  
Classification 
Cylinder Bands  39  277  2 
Breast Cancer Wisc.  29  569  2  
Ecoli  7  336  8  
Ionosphere  34  351  2  
Iris  4  150  3  
Madelon  500  2600  2  
Image Segmentation  19  2310  7  
Sonar  60  208  2  
Vowel  9  990  11  
Yeast  8  1484  10  
Regression 
Airfoil  6  1503  – 
Boston Housing  13  506  –  
Concrete  9  1030  –  
Dow Chemical  57  1066  –  
Energy Cooling  9  768  –  
Energy Heating  9  768  –  
Tower  26  4999  –  
Wine Red  12  1599  –  
Wine White  12  4898  –  
Yacht  7  308  –  
6 Results: performance
The results described in this section aim at assessing whether it is possible to construct few and small features that lead to an equal or better performance than the original set, and whether some search algorithms can construct better features than others.
6.1 General Performance of Feature Construction
We begin by observing the datasetwise aggregated performance of FCS for the different GP algorithms and the different ML algorithms, separately for classification and regression.
NB  SVM  RF  
Training 
Test 
Training 
Test 
Training 
Test 

$h=2$ 

$h=4$ 

LR  SVM  RF  
Training 
Test 
Training 
Test 
Training 
Test 

$h=2$ 

$h=4$ 

6.1.1 Classification
Figure 3 shows datasetwise aggregated results obtained for NB, SVM, and RF, for classification tasks. Each data point is the mean among the datasetspecific medians of macro F1 from the 30 runs.
In general, the use of only one constructed feature does not perform as good as the use of the original feature set (with exception for NB). Constructing more features improves the performance, but with diminishing returns.
Specifically for NB, the use of a single constructed feature is already preferable to the use of the original feature set (unless RS is used as constructor). This is likely due to the fact that NB assumes complete independence between the provided features, and this can be implicitly tackled by FCS. SGP (unbounded) is the best performing algorithm as it can evolve arbitrarily complex features, however, the magnitude of improvement of the macro F1 score w.r.t. GPGOMEA_{RT} and SGP_{b} is limited. For $h=4$ and $K=5$, GPGOMEA_{RT} reaches the performance of SGP. GPGOMEA_{RT} is typically slightly better than SGP_{b}, and RS has worse performance.
The performance of FCS for SVM has an almost identical pattern to the one observed for NB, except for the fact that the performance is found to be consistently better. However, for SVM it is preferable to use the original feature set rather than few constructed features. This is evident in terms of training performance, but less at test time. In fact, using only 5 constructed features leads to similar test performance compared to using the original set. The GP algorithms compare to each other similarly to when using NB.
The way performance improves for RF by constructing features is similar to the one observed for NB and SVM. However, for RF the differences between the search algorithms is particularly small: notice that using RS leads to close performance to the ones obtained by using the other GP algorithms, compared to the NB and the SVM case. Moreover, virtually no difference can be seen between GPGOMEA_{RT} and SGP_{b}. This suggests that RF already works well with less refined features. Now, the features constructed by SGP are no longer the best performing at test time. This is likely because SGP evolves larger, more complex features than the other algorithms (see Sec. 6.1.3), making RF overfit.
As to maximum tree height, allowing the constructed features to be bigger ($h=4$ vs $h=2$) moderately improves the performance. Interestingly, GPGOMEA_{RT} with $h=4$ reaches competitive performance with SGP on all ML algorithms, despite the latter having no strict limitation on feature size.
6.1.2 Regression
Results on the regression tasks are shown in Figure 4, datasetwise aggregated for LR, SVM, and RF. We report the results in terms of coefficient of determination, i.e., ${R}^{2}(y,\overline{y})=1MSE(y,\overline{y})/var(y)$. For the three ML algorithms, results overall follow the same pattern. SGP is typically better, although constructing more features reduces the performance gap with the other GP algorithms. GPGOMEA_{RT} is slightly, yet consistently, the best performing within the maximum tree height limitation of 2, while SGP_{b} is visibly preferable only when a single feature is constructed for LR and SVM, for $h=4$. Differently from the classification case, two features are typically enough to reach the performance of the original feature set for all ML algorithms. This is possibly due to the particular function set used in these experiments.
As for classification, allowing for larger trees results in better performance overall, and reduces the gap between SGP and the other GP algorithms.
6.1.3 Feature size
Figure 5 shows the aggregated feature size for the different GP algorithms and RS. The aggregated solution size is computed by taking the median solution size per run, then averaging over datasets, and finally averaging over ML algorithms (classification and regression are considered together). The picture shows how, overall, the known SGP tendency to bloat differs compared to the algorithms working with a strict tree height limitation. SGP features are so large that it is nearly impossible to interpret them (see Sec. 7.1).
RS finds the smallest features for both height limits $h=2$ and $h=4$. Considering that GPGOMEA_{RT} and SGP_{b} generate trees within the same height bounds of RS, we conclude that it is the variation operators that allow finding larger trees with improved fitness within the height limit. GPGOMEA_{RT} seems to construct slightly, yet consistently, larger trees than SGP_{b}.
For SGP, it can be seen that subsequently constructed features are smaller (this is barely visible for GPGOMEA_{RT} and SGP_{b} as well). This is interesting because we do not use any mechanism to promote smaller trees. This result is likely linked to the diminishing returns in performance observed in Figure 3 and 4: constructing new complex and informative features becomes harder with the number of FCS iterations.
Feature size 

6.2 Statistical Significance: Comparing GP Algorithms
The aggregated results of Section 6.1 show moderate differences between GPGOMEA_{RT} and SGP_{b}. These are arguably the most interesting algorithms to compare indepth, as they are able to construct small features that lead to good performance (RS typically constructs less informative features, while SGP constructs very large ones).
We perform statistical significance tests to compare GPGOMEA_{RT} and SGP_{b}. We consider their median performance on the test set $Te$, obtained by the FCS, and also compare it with the use of the original feature set, for each ML algorithm and each dataset. In our case, the subjects of our significance tests are the two search algorithms (i.e., GPGOMEA_{RT} and SGP_{b}) and the original feature set, while the treatments are the configurations given by pairing ML algorithms and datasets demvsar2006statistical .
We first perform a Friedman test, and then pairwise Wilcoxon signed rank tests, paired by treatment (ML algorithmdataset combination) demvsar2006statistical . We adopt the Holm correction method to account for false positive results due to chance holm1979simple .
We consider both $h=2$ and $h=4$, and focus on $K=2$, since consideration of only two constructed features makes interpretation easier, and allows human visualization (see Sec. 7.1).
6.2.1 Classification
For both $h=2,4$, the Friedman tests strongly indicates differences between GPGOMEA_{RT}, SGP_{b}, and the use of the original feature set ($p\text{value}\ll 0.05$).
Figure 6 (top) shows the Holmcorrected $p$values obtained by the pairwise Wilcoxon tests for classification, where the alternative hypothesis is that the row allows for larger macro F1 scores than the column. No significant differences between GPGOMEA_{RT} and SGP_{b} are found for both $h=2,4$, although both lead to generally better performance than use of the original feature set. The latter result appears to be in contrast with the results from Fig. 3 for SVM and RF, where it can be seen that the construction of only two features does, on average, lead to slightly worse results than using the original feature set. Nonetheless, the opposite is true for NB, and with larger magnitude. Considering these three ML algorithms together on all 10 datasets, there are more cases where using two evolved features is similarly good or better than using the original feature set, ultimately leading to the positive outcome of the statistical significance test.
The $p$value for GPGOMEA_{RT} is barely larger than 0.05 for $h=2$ (due to Holm’s correction), while the one of SGP_{b} is slightly smaller. SGP_{b} seems preferable in this case. However, we found that GPGOMEA_{RT}’s training performance is significantly better than the one of SGP_{b} (corrected$p\text{value}\ll 0.01$). This means that GPGOMEA_{RT} actually constructed better features w.r.t. the training set, which resulted in a slight overfitting at test time.
For $h=4$, again no significant difference is found between GPGOMEA_{RT} and of SGP_{b} for FCS. Both search algorithms enable significantly better performance than the use of the original feature set, with $p$values very close to 0.05.
6.2.2 Regression
As for classification datasets, the Friedman tests indicates that differences are presents between the subjects. Figure 6 (bottom) shows the Holmcorrected $p$values obtained by the pairwise Wilcoxon tests for regression.
The statistical tests confirm the hypothesis that the algorithms are capable of providing constructed features that are more informative than the original feature set, as observed in Fig. 4 for the regression datasets. Now, GPGOMEA_{RT} is significantly better than SGP_{b} when $h=2$. As found for classification, GPGOMEA_{RT} also here leads to better performance than SGP_{b} on the training set (corrected$p\text{value}\ll 0.01$), however, this time this is reflected at the test time, meaning than no relevant overfitting is happening. For $h=4$, GPGOMEA_{RT} is not found to be significantly better than SGP_{b}.
Classification 

Regression 

$h=2$ 
$h=4$ 
6.3 Statistical Significance: Two Constructed Features vs. the Original Feature Set per ML Algorithm
Results presented in Sec. 6.1 indicate that our FCS brings most benefit if used with the weak ML algorithms. We now report, for each ML algorithm, on how many datasets 2 features constructed using GPGOMEA (with $h=2$ and $h=4$) lead to statistically significantly (using Holmcorrected pairwise Wilcoxon test) better, equal, or worse results compared to using the original feature set on the test set. This is shown in Table 4.
These results confirm what seen in Figures 3 and 4. Using FCS typically outperforms the use of the original feature set for the weak ML algorithms. For the strong ML algorithms, in most cases, using the original feature set is preferable. However, for some datasets reducing the space to two compact features without compromising performance is still possible.
The use of the original feature set is hardest to beat for RF. In the regression case with $h=4$, FCS brings benefits on the datasets Airfoil, Energy Cooling, Energy Heating, and Yacht; and performs on par with the use of the original feature set on the datasets Boston Housing and Concrete. These datasets are the ones with the smallest number of original features. We find similar results for SVM. It is reasonable to expect that FCS works well when few features can be combined.
In the classification case, findings are different. For RF and $h=4$, the datasets where using two constructed features bring similar or better results than using the original feature set are Breast Cancer Wisconsin and Iris. The latter does have a small number of original features (4), but the former has more than several other datasets (29). Furthermore, the datasets where FCS helps are different for SVM: FCS performs equally good to the original feature set on Iris and Cylinder Bands (39 features), and better on Madelon (500 features) and Image Segmentation (19 features). For classification datasets, we cannot conclude that a small cardinality of the original feature set is a good indication feature construction will work well. Furthermore, feature construction influences different ML algorithms in different ways.
$h$  NB  SVM  RF  
Class. 
$2$  8/1/1  2/2/6  1/1/8 
$4$  8/1/1  2/2/6  1/1/8  
$h$  LR  SVM  RF  
Regr. 
$2$  5/3/2  5/2/3  4/0/6 
$4$  7/1/2  5/2/3  4/2/4 
7 Results: improving interpretability
The results presented in Sec. 6 showed that the original feature set can be already outperformed by two small constructed features in many cases. We now aim at assessing whether constraining features size can enable interpretability of the features themselves, as well as if extra insight can be achieved by plotting and visualizing the behavior of a trained ML model in the new twodimensional space.
7.1 Interpretability of Small Features
Table 5 shows some examples of features constructed by GPGOMEA_{RT}, for $h=2$ and $h=4$. We report the first feature constructed for the $K=2$ case, with median test performance. We show the first feature as it is typically not smaller than the second (see Fig. 5). Analytic quotients and protected logarithms are replaced by their respective definitions. We remark that we do not check whether the meaning of the features is sound (e.g., ensuring a certain unit of measure is returned). Constraining feature meaning is problemdependent, and outside the scope of this work.
For classification, we choose NB as it is the method which benefits most from feature construction. The dataset considered is Cylinder Bands, where NB achieves the largest median test improvement: from $F1=0.27$ with the original set, to $F1=0.60$ with $K=2$, for both $h=2$ and $h=4$.
For regression, we consider LR on the Concrete dataset, for the same aforementioned reasons. The test ${R}^{2}$ obtained with the original feature set is $0.59$, the one with two features constructed by GPGOMEA_{RT} is $0.76$ ($0.78$) for $h=2$ ($h=4$).
For $h=2$, we argue that constructed features are mostly easy to interpret. For example, the feature shown for LR on Concrete tells us that aging (${x}^{(8)}$) has a negative impact on concrete compressive strength, whereas using more water (${x}^{(4)}$) than cement (${x}^{(1)}$) has a positive effect (both features are in ${\text{kg/cm}}^{3}$). The impact of other features is less important (within the data variability of the dataset). For $h=4$, some features can be harder to read and understand, however many are still accessible. This is mostly because, even though the total solution size reachable with $h=4$ is 31, constructed features are typically half the size (see Fig. 5).
Overall, we cannot draw a strict conclusion on whether the features found by our approach are interpretable, as interpretability is a subjective matter and no clearcut metric exists lipton2018mythos ; guidotti2018survey . Yet, it appears evident that enforcing a restriction on their size is a necessary condition. We generally find that features using 15 or more nodes start to be hard to interpret w.r.t. our experimental settings, i.e., using our function set.Lastly, features constructed without a strict size limitation (by SGP) are in general very large, and thus extremely hard to understand. As an example, Figure 7 shows the first of the two features constructed by SGP for NB on Cylinder Bands (this is smaller than the first feature found by SGP for LR on Concrete).
$(((({\mathrm{log}}_{p}(({({x}^{(31)})}^{2}{\mathrm{log}}_{p}(({x}^{(24)}\xf7(({\mathrm{log}}_{p}({\mathrm{log}}_{p}({x}^{(3)}+{x}^{(31)}))\times ({({x}^{(3)}\times {x}^{(18)})}^{2}\times $  
$((({x}^{(24)}\times {x}^{(2)})+{x}^{(31)})\times {x}^{(3)}+{\mathrm{log}}_{p}({x}^{(3)}+{x}^{(34)})\xf7(({x}^{(3)}\times {x}^{(18)}+{x}^{(14)}+$  
${x}^{(34)})\xf7{x}^{(14)}))))+({x}^{(24)}\xf7((((({x}^{(6)}\xf7{({x}^{(5)})}^{2})\xf7{x}^{(3)})+{x}^{(3)})\times {x}^{(8)})+$  
${{x}^{(31)})))))))\times {x}^{(5)}+({x}^{(3)}\times {x}^{(18)}+{x}^{(31)})\xf7{x}^{(31)}))}^{2}\times {x}^{(9)})){}^{\frac{1}{2}}$ 
$h$  1st Feature  
NB 
2  $\left({x}^{\left(27\right)}+{x}^{\left(29\right)}\right)/\sqrt{1+{\left({x}^{\left(9\right)}{x}^{\left(13\right)}\right)}^{2}}$ 
4  ${x}^{\left(5\right)}{x}^{\left(10\right)}\sqrt{{x}^{\left(13\right)}}\mathrm{log}\left{x}^{\left(9\right)}\right$  
LR 
2  ${x}^{\left(4\right)}{x}^{\left(1\right)}+932.204/\sqrt{1+{\left({x}^{\left(8\right)}\right)}^{2}}$ 
4  $\sqrt{19.764\mathrm{log}\left{x}^{\left(8\right)}\right}+{x}^{\left(2\right)}+2{x}^{\left(1\right)}/\sqrt{1+{\left({x}^{\left(4\right)}\right)}^{2}}$ 
7.2 Visualizing What the ML Algorithm Learns
The construction of a small number of interpretable features can enable a better understanding of the problem and of the learned ML models. The case where up to two features are constructed is particularly interesting, since it allows visualization.
We provide one example of classification boundaries and one of a regressed surface, inferred by SVM on a two dimensional feature space obtained with our approach using GPGOMEA_{RT}.
The classification dataset on which we find the best test improvement for $h=4$ is Image Segmentation, where the $F1$ score of SVM reaches 0.88, against 0.65 using the original feature set (median run). Figure 8 shows the classification boundaries learned by SVM. The analytic quotient operator $\xf7$ and the protected log ${\mathrm{log}}_{p}$ are replaced by their definition for readability. The constructed features are rather complex here, yet readable. At the same time, it can be clearly seen how the training and test examples are distributed in the 2D space, and what classification boundaries SVM learned.
For regression, Figure 1 shows the surface learned by SVM on Yacht (median run), where GPGOMEA_{RT} with $h=2$ constructs two features that lead to an ${R}^{2}$ of 0.98, against 0.85 obtained using the original feature set. The features are arguably easy to interpret, while it can be seen that the learned surface accurately models most of the data points.
Dataset  Size  NB/LR  SVM  RF  
Clas. 
Iris  $150\times 4$  $7$ s  $2$ m  $25$ m 
Madelon  $2600\times 500$  $4$ m  $14$ h  $8$ h  
Regr. 
Yacht  $308\times 7$  $8$ s  $4$ m  $1$ h 
Tower  $4999\times 26$  $2$ m  $34$ h  $34$ h 
8 Running Time
Our results are made possible by evaluating the fitness of constructed features with crossvalidation, a procedure which is particularly expensive. Table 6 shows the (mean over 30 runs) serial running time to construct five features on the smallest and largest classification and regression datasets, using GPGOMEA_{RT} with $h=4$ and the parameter settings of Sec. 5, on the relatively old AMD Opteron™ Processor 6386 SE^{3}^{3} 3 http://cpuboss.com/cpu/AMDOpteron6386SE. Running time has a large variability, from seconds to dozens of hours, depending on dataset size and ML algorithm. For the datasets and ML algorithms we considered, it can be argued that our approach can be used in practice. However, for datasets with many more features and examples (e.g., image recognition datasets such as CIFAR krizhevsky2009learning ), and much more timeintensive ML algorithms (e.g., GP itself, or deep neural networks lecun2015deep ), to use FCS may not be feasible.
As to memory occupation, it basically mostly depends on the way the chosen ML algorithm handles the dataset. Our runs required at most few hundreds of MBs when dealing with the larger datasets, for SVM and RF.
9 Discussion
We believe this is one of the few works on evolutionary feature construction where the focus is put on both improving the performance of an ML algorithm, and on human interpretability at the same time. The interpretability we aimed for is twofold: understanding the meaning of the features themselves, as well as reducing their number. GP algorithms are key, as they can provide features as interpretable expressions given basic functional components, and a complexity limit (e.g., tree height).
We have run a large set of experiments, totaling more than 100,000 cpuhours. Our results strongly support the hypothesis that the original feature set can be replaced by few (even solely $K=2$) features built with our FCS without compromising performance in many cases. In some cases, performance even improved. GPGOMEA_{RT} and SGP_{b} achieve this result while keeping the constructed feature size extremely limited ($h=2,4$). SGP leads to slightly better performance than GPGOMEA_{RT} and SGP_{b}, but at the cost of constructing five to ten times larger features. RS proved to be less effective than the GP algorithms.
Our FCS is arguably most sensible to use for simpler ML algorithms, such as NB and LR. Constructed features change the space upon which the ML algorithm operates. SVM already includes the kernel trick to change the feature space. Similarly, the trees of RF effectively embody complex nonlinear feature combinations to explain the variance in the data. NB and LR, instead, do not include such mechanisms. Rather, they have particular assumptions on how the features should be combined (NB assumes normality, LR linearity). The features constructed by GP can transform the input the ML algorithm operates upon, to better fit its assumptions.
We found that performance was almost always significantly better than compared to using the original feature set for NB and LR. As running times for these ML algorithms can be in the order of seconds or minutes (Sec. 8), feature construction has the potential to be routinely used in data analysis and machine learning practice. Furthermore, FCS (or a modification where the constructed features are added to the original set) can be used as an alternative way to tune simple ML algorithms which have limited or no hyperparameters.
Regarding the comparison between the search algorithms, GPGOMEA_{RT} does typically result in better training performance than SGP_{b} (especially for $h=2,K=2$), and this is also found at test time for regression, but not for classification. We believe that significantly better results can be achieved if bigger population sizes and larger evaluations budgets can be employed (we kept the population size limited due to the computational expensiveness of SVM and RF).
Particularly for GPGOMEA, previous work has shown that having sufficiently large population sizes enables the possibility to exploit linkage estimation and perform betterthanrandom mixing virgolin2019model ; virgolin2017scalable . To validate this also within the framework of our proposed FCS, we scaled the population size and the budget of fitness evaluations, and compared the use of the LT with the use of the RT, on the largest classification dataset, Madelon, using NB. The outcome is shown in Figure 9: the employment of bigenough population sizes (and of sufficient numbers of fitness evaluation) indeed leads to better performance. All in all, we recommend the use of GPGOMEA as feature constructor since it was not worse on classification and was statistically better for regression. Furthermore, we advice to use the LT if the population size can be of the order of thousands or more (or even better, if exponential population sizing schemes are used as in virgolin2019model ; virgolin2017scalable ). Otherwise, the RT should be preferred.
To assess if small constructed features are interpretable and if it is possible to visualize what the behavior of learned ML models, we showed some examples, providing evidence that both requirements can be reasonably satisfied. We did not perform a thorough study on interpretability of constructed features. A proper evaluation of interpretability should likely involve targeted user studies. Experts of a field could give feedback on features constructed for datasets they are knowledgeable about. Nonetheless, enforcing features (and GP programs in general) to be small is a necessary condition to allow interpretability, and is typically ignored in literature virgolin2019model .
Considering the visualization examples proposed in Section 7, it is natural to compare our approach with wellknown dimensionality reduction techniques, such as Principal Component Analysis (PCA) wold1987principal or tDistributed Stochastic Neighbor Embedding (tSNE) maaten2008visualizing . We remark that those techniques and our FCS have very different objectives. In general, the sole aim of such techniques is to reduce the data dimensionality. PCA does so by detecting components that capture maximal variance. However, it does not attempt to optimize the transformation of the original feature set to improve an ML algorithm’s performance. Also, PCA does not focus on the interpretability of the feature transformations. FCS takes the performance of the ML algorithm and interpretability of the features into account, while dimensionality reduction comes from forcing the construction of few features. We compared using 2 features constructed with RS (the worst search algorithm) with maximum $h=2$, with using the first 2 PCs found by PCA. The use of constructed features over PCs resulted in significantly superior test performance for all ML algorithms and for all problems. We remark, however, that PCA is extremely fast and independent from the ML algorithm.
Our FCS has several limitations. A first limitation regards the performance obtainable by the ML algorithm using the constructed features. FCS is iterative, and this can lead to suboptimal performance for a chosen $K$, compared to attempting to find $K$ features at once. This is because the contributions of multiple features to an ML algorithm are not necessarily perpendicular to each other tran2019genetic . FCS could be changed to find at any given iteration, a synergistic set of $K$ features, that is independent from previous iterations. To this end, larger population sizes need to be employed, and the search algorithms need to be modified so that they can evolve sets of constructed features (a similar proposal for SGP was done in krawiec2002genetic ). Yet, it is reasonable to expect that if $K$ features need to be learned at the same time, larger population sizes may be needed compared to learning the $K$ features iteratively.
Another limitation of this work is that hyperparameter tuning was not considered. To include hyperparameter tuning within FCS could bring even higher performance scores, or help prevent overfitting. A possibility could be, for example, to evolve pairs of features and hyperparameter settings, where every time a feature is evaluated, the optimal hyperparameters are also searched for. Such a procedure may likely require strong parallelization efforts, as $C$fold crossvalidation should be carried out for each combination of hyperparameter values.
Lastly, it would be interesting to extend our approach to other classification and regression settings, e.g., problems with missing data; or to unsupervised tasks, as simple features may lead to better clustering of the examples.
10 Conclusion
With a simple evolutionary feature construction framework we have studied the feasibility of constructing few crucial and compact features with Genetic Programming (GP), towards improving the explainability of Machine Learning (ML) models without losing prediction accuracy. Within the proposed framework, we compared standard GP, random search, and the GP adaptation of the Genepool Optimal Mixing Evolutionary Algorithm (GPGOMEA) as feature constructors, and found that GPGOMEA is overall preferable when strict limitations on feature size are enforced. Despite limitations on feature size, and despite the reduction of problem dimensionality that we imposed by constructing only two features, we obtained equal or better ML prediction performance compared to using the original feature set for more than half the combinations of datasets and ML algorithms. In many cases, humans can understand what the feature means, and it is possible to visualize how trained ML models will behave. All in all, we conclude that feature construction is most useful and sensible for simpler ML algorithms, where more resources can be used for evolution (e.g., larger population sizes), which, in turn, unlock the added benefits of more advanced evolutionary mechanisms (e.g., using linkage learning in GPGOMEA).
Acknowledgments
The authors acknowledge the Kinderen Kankervrij foundation for financial support (project #187). The majority of the computations for this work were performed on the Lisa Compute Cluster with the support of SURFsara.
References
 (1) H. Liu, H. Motoda, Feature extraction, construction and selection: A data mining perspective, Vol. 453, Springer Science & Business Media, 1998.
 (2) J. Friedman, T. Hastie, R. Tibshirani, The elements of statistical learning, Vol. 1, Springer series in statistics New York, NY, USA:, 2001.
 (3) J. R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection, MIT Press, Cambridge, MA, USA, 1992.
 (4) R. Poli, W. B. Langdon, N. F. McPhee, J. R. Koza, A field guide to genetic programming, Lulu. com, 2008.
 (5) J. H. Friedman, Multivariate adaptive regression splines, The annals of statistics (1991) 1–67.
 (6) D. W. Hosmer Jr, S. Lemeshow, R. X. Sturdivant, Applied logistic regression, Vol. 398, John Wiley & Sons, 2013.
 (7) Z. C. Lipton, The mythos of model interpretability, Queue 16 (3) (2018) 30:31–30:57.
 (8) R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi, A survey of methods for explaining black box models, ACM computing surveys (CSUR) 51 (5) (2018) 93.
 (9) A. Adadi, M. Berrada, Peeking inside the blackbox: A survey on explainable artificial intelligence (xai), IEEE Access 6 (2018) 52138–52160.
 (10) B. Goodman, S. Flaxman, European union regulations on algorithmic decisionmaking and a “right to explanation”, AI Magazine 38 (3) (2017) 50–57.
 (11) M. Virgolin, T. Alderliesten, C. Witteveen, P. A. N. Bosman, Improving modelbased genetic programming for symbolic regression of small expressions, CoRR abs/1904.02050. arXiv:1904.02050.
 (12) B. Xue, M. Zhang, W. N. Browne, X. Yao, A survey on evolutionary computation approaches to feature selection, IEEE Transactions on Evolutionary Computation 20 (4) (2016) 606–626.
 (13) K. Krawiec, Genetic programmingbased construction of features for machine learning and knowledge discovery tasks, Genetic Programming and Evolvable Machines 3 (4) (2002) 329–343.
 (14) L. Breiman, Classification and regression trees, Routledge, 2017.
 (15) M. Muharram, G. D. Smith, Evolutionary constructive induction, IEEE Transactions on Knowledge and Data Engineering 17 (11) (2005) 1518–1528.
 (16) B. Tran, B. Xue, M. Zhang, Genetic programming for feature construction and selection in classification on highdimensional data, Memetic Computing 8 (1) (2016) 3–15.
 (17) N. S. Altman, An introduction to kernel and nearestneighbor nonparametric regression, The American Statistician 46 (3) (1992) 175–185.
 (18) S. J. Russell, P. Norvig, Artificial intelligence: a modern approach, Malaysia; Pearson Education Limited,, 2016.
 (19) K. P. Murphy, Naive Bayes classifiers, University of British Columbia 18.
 (20) Q. Chen, M. Zhang, B. Xue, Genetic programming with embedded feature construction for highdimensional symbolic regression, in: Intelligent and Evolutionary Systems, Springer, 2017, pp. 87–102.
 (21) A. Cano, S. Ventura, K. J. Cios, Multiobjective genetic programming for feature extraction and data visualization, Soft Computing 21 (8) (2017) 2069–2089.
 (22) M. Virgolin, T. Alderliesten, A. Bel, C. Witteveen, P. A. N. Bosman, Symbolic regression and feature construction with GPGOMEA applied to radiotherapy dose reconstruction of childhood cancer survivors, in: Genetic and Evolutionary Computation (GECCO 2018), ACM, 2018, pp. 1395–1402.
 (23) B. Tran, B. Xue, M. Zhang, Genetic programming for multiplefeature construction on highdimensional classification, Pattern Recognition 93 (2019) 404–417.
 (24) C. Cortes, V. Vapnik, Supportvector networks, Machine learning 20 (3) (1995) 273–297.
 (25) L. Breiman, Random forests, Machine learning 45 (1) (2001) 5–32.
 (26) R. Kohavi, G. H. John, Wrappers for feature subset selection, Artificial intelligence 97 (12) (1997) 273–324.
 (27) M. Virgolin, T. Alderliesten, C. Witteveen, P. A. N. Bosman, Scalable genetic programming by genepool optimal mixing and inputspace entropybased buildingblock learning, in: Genetic and Evolutionary Computation (GECCO 2017), ACM, New York, NY, USA, 2017, pp. 1041–1048.
 (28) T. P. Pawlak, B. Wieloch, K. Krawiec, Semantic backpropagation for designing search operators in genetic programming, IEEE Transactions on Evolutionary Computation 19 (3) (2015) 326–340.
 (29) R. R. Curtin, J. R. Cline, N. P. Slagle, W. B. March, P. Ram, N. A. Mehta, A. G. Gray, MLPACK: A scalable C++ machine learning library, Journal of Machine Learning Research 14 (2013) 801–805.
 (30) C.C. Chang, C.J. Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology 2 (2011) 27:1–27:27, software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
 (31) M. N. Wright, A. Ziegler, ranger: A fast implementation of random forests for high dimensional data in C++ and R, arXiv preprint arXiv:1508.04409.
 (32) J. Ni, R. H. Drieberg, P. I. Rockett, The use of an analytic quotient operator in genetic programming, IEEE Transactions on Evolutionary Computation 17 (1) (2013) 146–152.
 (33) D. R. White, J. Mcdermott, M. Castelli, L. Manzoni, B. W. Goldman, G. Kronberger, W. Jaśkowski, U.M. O’Reilly, S. Luke, Better GP benchmarks: community survey results and proposals, Genetic Programming and Evolvable Machines 14 (1) (2013) 3–29.
 (34) J. Albinati, G. L. Pappa, F. E. Otero, L. O. V. Oliveira, The effect of distinct geometric semantic crossover operators in regression problems, in: European Conference on Genetic Programming, Springer, 2015, pp. 3–15.
 (35) J. Demšar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (2006) 1–30.
 (36) S. Holm, A simple sequentially rejective multiple test procedure, Scandinavian journal of statistics (1979) 65–70.
 (37) A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny images, Tech. rep., Citeseer (2009).
 (38) Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436.
 (39) S. Wold, K. Esbensen, P. Geladi, Principal component analysis, Chemometrics and intelligent laboratory systems 2 (13) (1987) 37–52.
 (40) L. v. d. Maaten, G. Hinton, Visualizing data using tsne, Journal of Machine Learning Research 9 (2008) 2579–2605.