Abstract
The paper presents Imbalance-XGBoost, a Python package that combines thepowerful XGBoost software with weighted and focal losses to tackle binarylabel-imbalanced classification tasks. Though a small-scale program in terms ofsize, the package is, to the best of the authors' knowledge, the first of itskind which provides an integrated implementation for the two losses on XGBoostand brings a general-purpose extension on XGBoost for label-imbalancedscenarios. In this paper, the design and usage of the package are describedwith exemplar code listings, and its convenience to be integrated intoPython-driven Machine Learning projects is illustrated. Furthermore, as thefirst- and second-order derivatives of the loss functions are essential for theimplementations, the algebraic derivation is discussed and it can be deemed asa separate algorithmic contribution. The performances of the algorithmsimplemented in the package are empirically evaluated on Parkinson's diseaseclassification data set, and multiple state-of-the-art performances have beenobserved. Given the scalable nature of XGBoost, the package has greatpotentials to be applied to real-life binary classification tasks, which areusually of large-scale and label-imbalanced.
Quick Read (beta)
Imbalance-XGBoost: Leveraging Weighted and Focal Losses for Binary Label-Imbalanced Classification with XGBoost
Abstract
The paper presents Imbalance-XGBoost, a Python package that combines the powerful XGBoost software with weighted and focal losses to tackle binary label-imbalanced classification tasks. Though a small-scale program in terms of size, the package is, to the best of the authors’ knowledge, the first of its kind which provides an integrated implementation for the two losses on XGBoost and brings a general-purpose extension on XGBoost for label-imbalanced scenarios. In this paper, the design and usage of the package are described with exemplar code listings, and its convenience to be integrated into Python-driven Machine Learning projects is illustrated. Furthermore, as the first- and second-order derivatives of the loss functions are essential for the implementations, the algebraic derivation is discussed and it can be deemed as a separate algorithmic contribution. The performances of the algorithms implemented in the package are empirically evaluated on Parkinson’s disease classification data set, and multiple state-of-the-art performances have been observed. Given the scalable nature of XGBoost, the package has great potentials to be applied to real-life binary classification tasks, which are usually of large-scale and label-imbalanced.
Imbalance-XGBoost: Leveraging Weighted and Focal Losses for Binary Label-Imbalanced Classification with XGBoost
A Preprint
Chen Wang††thanks: Corresponding Author(s)
Department of Computer Science
Rutgers University - New Brunswick
Piscataway, NJ 08854, USA
[email protected]
Chengyuan Deng
Department of Computer Science
Rutgers University - New Brunswick
Piscataway, NJ 08854, USA
[email protected]
Suzhen Wang††footnotemark:
Department of Health Statistics
Weifang Medical University
Weifang, Shandong 261053, China
[email protected]
January 15, 2021
Keywords Imbalanced Classification XGBoost Python Package
1 Introduction
XGBoost is an advanced Gradient Tree Boosting-based software that can efficiently handle large-scale Machine Learning tasks[1]. Merited by its performance superiority and affordable time and memory complexities, it has been widely applied to a variety of research fields since been proposed, ranging from cancer diagnosis[2] and medical record analysis[3] to credit risk assessment[4] and metagenomics[5]. Also, because of its easy-to-use Python interface and explainable nature, it has become the de facto method-of-the-first-choice for a majority of data science problems. For instance, in the famous Kaggle11 1 https://www.kaggle.com/competitions competitions, many winning teams built their models based on XGBoost and expressed positive views on the method and its variations[1][6][7]. It could be tentatively predicted that in the near future, XGBoost and its variations will remain one of the most-applied methods in the data science community.
On the other hand, although XGBoost has achieved considerable success on both regression and classification problems, its performance often becomes subtle when a situation of label-imbalanced classification emerges. There have been mixed reports on the capabilities of XGBoost in handling label-imbalanced data. For example, (Zhao et al., 2018)[8] demonstrated through their experiments that XGBoost can outperform other methods on skewed data sets, while the figures in (Luo et al., 2018)[9] suggested that vanilla XGBoost must be combined with other ensembling methods to achieve satisfactory results. It is noticeable that XGBoost is not designed for label-imbalanced data, and to be fair, most ’vanilla’ Machine Learning algorithms suffer performance decline when the ratio between labels becomes biased. However, given the popularity of XGBoost and the fact that label-skewed data is, unfortunately, commonly encountered in practice, this performance decay will still leave significant negative effects on related research and applications.
This paper introduces imbalance-XGBoost, an XGBoost-based Python package addressing the label-imbalanced issue in the binary label regime by implementing weighted (cross-entropy) and focal losses on the boosting machine. Weighted cross-entropy loss is one of the simplest algorithm-level cost-sensitive methods[10] for learning imbalanced data. It follows the straightforward idea to increase the penalization of misclassifying certain classes, and it has been widely applied to adjust vanilla machine learning algorithms to the label-imbalanced domain[11]. In contrast, focal loss[12] is a relatively novel method originated from research in object detection. The idea of the method is to add a factor to the cross-entropy function (where is the prediction of ), and this will reduce the importance of the well-classified data points. Comparing with weighted cross-entropy, focal loss enjoys a more robust parameter configuration as the method will work in our favor as long as .
To the best of the authors’ knowledge, there has not been significant publication discussing the implementation of the two losses on XGBoost previously. Existing studies on XGBoost under label-imbalanced scenarios usually adopt data-level approaches such as re-sampling[13] and/or cost-sensitive loss with non-trainable a priori modifications[14]. (Chen et al., 2017)[15] mentioned weighted XGBoost in their work, but details regarding the implementation are not presented. A major challenge in applying the two loss functions to XGBoost is that to approximate the incremental learning objective, first- and second-order derivatives of the loss function with respect to the predictions must be presented (One can refer to section 3 for more details on this). And an algebraic contribution of this paper is the derivations and implementations of the derivatives that enable the two losses to be run with XGBoost.
The package is written in Python with hard dependency on XGBoost, Numpy[16], and Scikit-learn[17]. The losses are integrated into the XGBoost system by the customized loss framework of the software, provided the derived expressions of the derivatives. Since the major methods in the program are included in the dependency graph, the core part of the package is of small scale, with only a few hundred lines of Python codes. Nevertheless, the function derivatives and implementations and the significance in practical applications make the work non-trivial. The main class (containing the methods) is designed as a child class of classes BaseEstimator and ClassifierMixin of Scikit-learn, and this enables most data science methods in Scikit-learn to be applied to the corresponding object with trivial efforts. The software has been released on Github11 1 https://github.com/jhwjhw0123/Imbalance-XGBoost and PyPI22 2 https://pypi.org/project/imbalance-xgboost/, and it has started to attract users with considerable downloads.
The rest of the paper goes as follows. Section 2 introduces the package from the perspectives of toolkit designers and users; Section 3 provides the theoretical foundation of the second-order approximation of gradient boosting trees used in XGBoost and the first- and second-order derivatives of the two losses; Some related studies are surveyed and discussed in section 4, and the performances of the package are empirically examined on Parkinson’s disease diagnosis data in section 5; And finally, section 6 gives a general conclusion of the paper.
2 Design and Usage of Imbalance-XGBoost
2.1 Code Design
Though the XGBoost method has implementations in multiple languages, Python is picked as the language-of-choice for its wide recognition and application in data science. The codes follow the standard of PEP8, and the project has been designed as open-source with codes on the Github page. The authors strive to keep the program consistent with ’standard’ practices in Python-based data science, as this can make it easier for users to get familiar with the package as well as to integrate it into their own projects. The input data is designated as Numpy array[16], but by explicitly adding np.array() conversion, data types compatible with Numpy array (e.g. Pandas Dataframe[18]) will also work on the package. As a small project, the usage of it can be clearly presented with the Readme file, and there is no additional documentation required.
The overall program is consist of three classes: one main class imbalance_xgboost, which contains the method the users will be applying, and two customized-loss classes, Weight_Binary_Cross_Entropy and Focal_Binary_Loss, on which the imbalanced losses are based. The loss functions are designed as separate classes for the convenience of parameter tuning, and they are not supposed to be called by the users. When initializing an imbalance_xgboost object, keyword special_objective will be recorded by the __init__() method. Then, when executing the fit() function, the corresponding loss function object will be instantiated with the given parameter, and the built-in xgb.train() method of XGBoost will be able to fit the model based on the customized loss function. Figure 1 illustrates the overall structure of the program.
[width=0.9]ProgramOverall.png
Listing 1 demonstrates a sample usage of the package to fit a dataset without parameter tuning. It could be observed from the listing that the type of XGBoost is specified during the instantiation of the object, while parameters are fed when calling the fitting function. The fitting function also has a exception handling mechanism: if the corresponding parameter ( or ) is not provided for a specific type of special_objective, a ValueError will be raised with the information that the essential parameter is missing.
As it has been stated before, the package is designed to be an estimator class of the Scikit-learn toolkit. This scheme enables the model and parameter selection methods in Scikit-learn, such as GridsearchCV() and RandomizedSearchCV(), to be directly applied to find the best parameters for imbalanced XGBoosts. In practical data science tasks, this feature is crucial as the optimal models rely heavily on parameter tuning and selection. Also, estimator in Scikit-learn can be combined with other estimators (transformers) by integrating them to a Pipeline object[17]. This allows the weighed- and focal-XGBoost to be easily combined with other pre-processing methods, such as resampling, to produce more robust results. Section 2.2 will provide more details for the package to tune parameters and perform cross-validation with Scikit-learn.
Table 1 provides the major methods/functions to be used in this package. They can be categorized into three groups: model fitting, model prediction, and evaluation scores. The ’basic’ methods are formed by overriding functions in Scikit-learn estimators (e.g. fit()), and some methods for extensions and variations are named in a ’literal’ style (e.g. predict_sigmoid()). To offer a ’downward compatible’ solution, the package also allows users to call vanilla XGBoost by not specifying the objective function. The output of the method, by default, will be ’raw logits’ without being processed by the Sigmoid function. Thus, the evaluation functions have been modified accordingly. Multiple evaluation functions are provided for a purpose of convenient evaluation, and details of evaluation functions are provided in section 2.3.
Function | Description | |
---|---|---|
Model Fitting | fit | train the XGBoost model |
Model Prediction | predict | predict the raw logits(without Sigmoid transformation) |
predict_sigmoid | predict the Sigmoid output () | |
predict_determine | predict the label (0 or 1) | |
predict_two_classes | predict the label with one-hot encoding | |
Evaluation Score | score | overriding accuracy score |
score_eval_func | flexible evaluation score with multiple metrics | |
correct_eval_func | collecting prediction correctness for cross-validation |
2.2 Model Optimization and Evaluation with Scikit-Learn
Listing 2 illustrates an example of parameter tuning and cross-validation with Scikit-learn for Imbalance-XGBoost. Similar to common classifiers in Scikit-learn, the best classifier/parameter can be obtained through exhaustive or random search with the functions GridsearchCV(). It is noticeable that after fitting the model, it is possible to retrieve the ’plain’ booster by accessing member opt_booster.booster, and the object will be a XGBoost class (instead of Imbalance-XGBoost class). This makes it possible for the user to train the model on a machine where Imbalance-XGBoost is available, save the model as ’plain’ XGBoost, and run on another machine where only the original XGBoost package is installed.
The cross-validation evaluation part is, like the parameter tuning, very similar to an ’ordinary’ classifier, and most of the usage guideline can be found from the documentation of Scikit-Learn. The only part to notice here is that listing 2 actually provides a combination of parameter selection and cross-validation evaluation, and a new booster is instantiated after the best parameters are obtained. The reason for not using the optimal booster provided by GridsearchCV() is that one wants the XGBoost to be trained from a randomized state to ensure a fair evaluation.
2.3 Built-in Evaluation Score
As one can observe in table 1, there are three evaluation functions in the package. The overriding score() function serves the purpose to evaluate prediction accuracy under the format of predictions, which are pre-sigmoid values (in range ) by default, by wrapping the sigmoid transformation and accuracy checking together. In comparison, function score_eval_func() is the method to return metrics other than accuracy. In label-imbalanced binary classification, accuracy cannot reliably reveal the performance quality on its own as the metric can be ’tricked’ by predicting all the instances as the majority class. This type of prediction will lead to high accuracy, yet the classifier actually does nothing. Thus, metrics taking ’preciseness’ into accounts, such as precision, recall, score and Matthew’s Correlation Coefficient(MCC)[19], are often applied for the scenario[20]. Function score_eval_func() provides implementations for all the metrics mentioned above, and it can be overloaded by specifying the partial argument ’mode’(which can be accomplish by functools.partial()).
In the cases of leave-one-out/leave-few-out cross validation, any metric other than accuracy will likely become ill-defined. For example, for the precision metric in leave-one-out cross validation, if the prediction is 0 for the single instance, then it is meaningless to compute the ’precision’. In such situation, one will instead wish to collect the classification correctness of each prediction, sum up the evaluations, and compute the metrics with the confusion matrix based on every ’test instance’. To make this possible, the package provides function correct_eval_func in the program. The function can be overloaded by the ’mode’ argument, and the four choices TP, FP, TN and FN represent True Positive, False Positive, True Negative and False Negative, respectively. It is noticeable that the four methods should be used simultaneously to produce a complete confusion matrix, and a wrapper to combine them into one function can be an extension of the package in the future.
3 Theories and Derivatives
In this section, the mathematical foundations and derivations for the loss functions to be applied are discussed. For a high-level introduction, since XGBoost adopts an additive learning scheme with a second-order approximation, the first-order derivative (short-handed as ’gradient’) and second-order derivative (noted as ’hessian’ although somehow a misnomer) of the loss functions with respect to the prediction are required for fitting the model. To illustrate a clear mechanism, the section will first review the second-order approximation of additive tree boosting in section 3.1. Subsequently, the derivatives of gradients and hessians of the weighted and focal losses will be discussed in section 3.2 and 3.3, respectively.
The notations used in this section will be as follows. We use to denote the number of data and for the number of features. The ’raw prediction’ before the sigmoid function will be denoted as , and the probabilistic prediction will be , where is used to represent the sigmoid function. It is important to keep in mind that there is a discrepancy between the notations of this paper and the original XGBoost literature([1]), as the in their analysis is denoted as here. is used to denote the true label, and and are used for the parameters for the two loss functions, respectively. The expressions of the gradients/hessians are noted in a merged format independent from the value of , as this can simplify the program implementation and help vectorization in other related programs.
3.1 Second-order Approximation of Gradient Boosting Tree
According to [1], the additive learning objective used in practice is:
(1) |
where denotes the -th iteration of the training process. Notice that the replacement of the notations has been applied in the equation. Applying second-order Taylor expansion on equation 1, one will get:
(2) |
The last line comes from the fact that the term can be removed from the learning objective as it is unrelated to the fitting of the model in the -th iteration. In equation 2, there are and , which are the ’gradient’ and ’hessian’ terms mentioned before. Notice that both and are scalars, as individual boosting trees only deal with binary problems. Multi-class classification tasks are usually processed by an ensemble of binary classification trees (so-called one-vs-all scheme)[21][22]. This is also the reason why the authors think the terms are somehow used as misnomers.
Since XGBoost does not provide automatic differentiation, the hand-derived derivatives will be essential. Meanwhile, the derived expressions have further potentials to be applied into other machine learning tasks. Therefore, the derivatives are discussed in sections 3.2 and 3.3. For both loss functions, sigmoid is selected as activation, and the following basic property of sigmoid will be consistently used in the derivatives:
(3) |
3.2 Weighted Cross-entropy Loss
The weighted cross-entropy loss for binary classification can be denoted as follows:
(4) |
where indicates the ’imbalance parameter’. Intuitively, if is greater than 1, extra loss will be counted on ’classifying 1 as 0’; On the other hand, if is less than 1, the loss function will weight relatively more on whether data points with label 0 are correctly identified.
The first order derivative is presented as follows:
(5) |
The derivative is similar with the term for ordinary cross-entropy loss. A notable difference is that a term is added to control the present of the parameter.
Taking derivative with respect to again, one will get the second-order derivative:
(6) |
After plugging equation 3 to the derivation.
3.3 Focal Loss
According to [12], the binary focal loss can be denoted as:
(7) |
As one can observe, if one sets , the equation will become ordinary cross-entropy loss. Taking equation 3 into consideration, the first derivative of the focal loss can be denoted as:
(8) |
And if is set to 0 in equation 8, the derivative will be the same as cross-entropy loss. The equation follows a clear structure, but it is still lengthy. To simplify the expression, one can set the following short-hand variables:
(9) |
Plugging these representations into equation 8, the first-order derivative can be denoted as:
(10) |
Finally, taking derivatives with respect to , and combining with equation 3 and 10, one can get the second-order derivative (’hessian’), which can be denoted as:
(11) |
Again, if , the second-order derivative becomes , which matches the formula of the second-order derivative of ordinary cross-entropy.
4 Related Work
The paper is built on the foundation of the original papers of XGBoost[1] and focal loss[12], and the methodology to program customized loss function is provided in the software’s Github page11 1 https://github.com/dmlc/xgboost/blob/master/demo/guide-python/custom_objective.py. XGBoost is based on the algorithm of gradient tree boosting[23], and this method has been deemed as a powerful Machine Learning technique long before the XGBoost was born[24]. Besides XGBoost, there are other implementations of gradient boosting, such as pGBRT[25], LightGBM[26], and CatBoost[27]. Some of the implementations have additional features and are able to outperform XGBoost on some specific problems, but XGBoost remains the method-of-the-first-choice in the data science community at large. As for the recently proposed focal loss, studies related to it are usually affiliated with Neural Networks and Deep Learning[28][29][30]. The loss function is usually applied in an end-to-end manner with automatic differentiation, and to the best of the authors’ knowledge, there has not been any notable publication comprehensively discussing the derivatives of the loss function (despite the first-order derivative was briefly discussed and presented in another form in the original paper[12]).
Previous applications of XGBoost in label-imbalanced scenarios focus mostly on data-level algorithms. For example, (Kabir et al., 2018)[13] applies several commonly-used data resampling methods before using XGBoost for label-imbalanced breast cancer classification, and (He et al., 2018)[31] utilized a more advanced under-sampling method called BalanceCascade[32] with XGBoost for credit scoring. Among the limited number of publications discussing algorithm-level modification for XGBoost in imbalanced classification, (Xia et al., 2017)[14] used a a prior modification of the sigmoid activation to achieve a better result, but the loss function was unchanged. As it has been mentioned in section 1, (Chen et al., 2017)[15] is by far the only implementation explicitly applied weighted function to XGBoost to best of the authors’ knowledge. It is noticeable that a Tensorflow-based gradient boosting implementation called Tf Boosted Trees[33] is able to run with the loss functions without the derivatives provided in this paper as it has an automatic differentiation mechanism. Nevertheless, it is a less popular package without supports of large-scale Machine Learning and compatibility with Scikit-learn toolkit.
As a common issue frequently encountered in practice, label-imbalanced classification has been intensively studied by researchers and there are multiple existing software programs designed to handle the problem. For a great example, (Lemaitre et al., 2017)[34] provides an integrated Python package called Imbalanced-learn for data-level resampling for imbalanced classification, and it has similar counterparts in the regime of other programming languages, such as ROSE in R[35]. It is worth noting that the Imbalanced-learn package can be considered as an extension of Scikit-learn, and the Machine Learning toolkit itself also provides elementary methods to deal with label-imbalanced problems[17]. Other software programs concerning label-imbalanced classification include popular Data Mining toolkits, such as KEEL[36] and WEKA[37]. In addition, (Zhang et al., 2019)[38] provides a software containing a set of algorithms specifically for multi-class label-imbalanced problems, serving as one of the most recent studies on this topic.
5 Experiments
In this section, experimental results based on Parkison’s disease classification data are discussed. Experimental results suggest that the special XGBoost methods implemented in the package outperform best existing approaches known to the authors on the same task, and the pattern of the predictions can meet the expectations. The dataset and experiment setup will be first discussed in section 5.1, and results and discussions will be presented in section 5.2.
5.1 Dataset and Setup
To lead out, the Parkinson’s Disease(PD) classification data11 1 available publicly, url: https://archive.ics.uci.edu/ml/datasets/Parkinson%27s+Disease+Classification is first introduced[39]. As a recent-collected dataset with 757 features categorized into 7 specific groups (two originally separate groups, Bandwidth and Formant, are merged in experiments), the data was gathered from 188 Parkinson’s disease patients and 64 healthy individuals at the Department of Neurology in Cerrahpaşa Faculty of Medicine, Istanbul University [39]. Each individual corresponds to 3 records, and due to the differences between the number of participants of the two sides, a label imbalance ratio of 188:64 (roughly 3:1) emerges.
As a new dataset, the best known classification results are demonstrated in [39] with seven traditional classification algorithms and two ensemble approaches. It is noticeable that the figures reported in [39] are less strong than existing best performances concerning Parkinson’s disease. The authors of that paper provided an explanation that since the experiments were conducted with leave-one-object-out cross validation, the classification task becomes more challenging as the information of the same object (person) is no way to be found in the training data (different from leave-one-record-out). To keep consistent with the original system, in the setup of the experiments of this paper, the same cross validation technique is applied. Furthermore, the results in [39] illustrate a high accuracy and relatively lower score, indicating that the classifiers failed to tell the two classes clearly and likely achieved the performance by overwhelming predicting the majority class. This is an unfavourable behavior in label-imbalanced classification, and as one will see in the following sections, one advantage of Imbalance-XGBoost is that it does not suffer from this problem.
As mentioned, parameters and will affect the performances of weighted and focal loss, and a parameter search is often deemed necessary in Machine Learning models. Therefore, in our experiments, grid search is applied through the GridsearchCV() of Scikit-learn to explore the optimal models. The searching range of is set to and parameter is selected from the candidacies of . Notice that is set to less than in the experiments since the number of data points with ’1’ label (patients) are the majority class of the dataset. To conduct the leave-one-object-out cross validation, the correctness collection function mentioned in section 2.3 is applied. By collecting results of True_Positive (TP), True_Negative (TN), False_Positive (FP), and False_Negative (FN), the confusion metric can be obtained and accuracy and score are computed accordingly. The records are evaluation in a per-record manner, which means the 3 records of one object (patient/healthy individual) will be evaluated individually, and 3 counts of the correctness will be added.
5.2 Classification Results and Discussion
Accuracy and score of the test set with 6 sets of features are presented in Table 2, where Best in [39] indicates the best performance of accuracy and score retrieved from the paper.
Baseline features | MFCC | Wavelet features | ||||
---|---|---|---|---|---|---|
Accuracy | score | Accuracy | score | Accuracy | score | |
Best in [39] | 0.79 | 0.75 | 0.84 | 0.83 | 0.78 | 0.74 |
Weighted-XGBoost | 0.76 | 0.85 | 0.80 | 0.87 | 0.75 | 0.85 |
Focal-XGBoost | 0.76 | 0.85 | 0.82 | 0.89 | 0.75 | 0.85 |
Bandwidth + Formant | Intensity Based | Vocal Fold-Based | ||||
Accuracy | score | Accuracy | score | Accuracy | score | |
Best in [39] | 0.77 | 0.72 | 0.77 | 0.74 | 0.77 | 0.74 |
Weighted-XGBoost | 0.74 | 0.85 | 0.75 | 0.85 | 0.75 | 0.84 |
Focal-XGBoost | 0.75 | 0.85 | 0.75 | 0.85 | 0.76 | 0.85 |
Without exception, a slight declination of accuracy could be observed in weighted-XGBoost and focal-XGBoost, but both classifiers generate a significantly higher score. The increase of score and the decrements of accuracy suggest that the previous-obtained higher accuracy is a consequence of overlooking minority class, so it is reasonable for our classifier to appear to ’sacrifice’ accuracy in order to guarantee impartial recognition results on both classes. Furthermore, for almost all the feature groups, the highest score is obtained by focal-XGBoost. This observation can be explained from an algorithmic perspective that focal loss is more robust to parameters, while weighted loss is prone to the effect of sub-optimal parameters even if parameter search is applied.
To eliminate potential impacts on the classification performance due to intrinsic characteristics of individual sets of features, a classifier with 50 top-ranked features selected by mRMR (minimum Redundancy-Maximum Relevance) [40] was applied in [39] as well. The feature selection method is based on the principle of maximizing the joint dependency of top ranking variables on the targeted one by reducing the redundancy among them [40] [41]. For a comparison purpose, this paper employs the same technique with provided Python interface11 1 https://github.com/fbrundu/pymrmr, and produces a subset of top-50 features to run with Imbalance-XGBoost. The performance of weighted- and focal-XGBoost on the top-50 features can be observed in table 3.
Top 50 Features | ||
---|---|---|
Accuracy | score | |
Best in [39] | 0.86 | 0.84 |
Weighted-XGBoost | 0.82 | 0.88 |
Focal-XGBoost | 0.83 | 0.89 |
Consistent with the performance on individual groups of features, focal-XGBoost classifier has the highest score, slightly better than weighted-XGBoost. Both weighted- and focal-XGBoost outperform best classifier in [39] by a large margin, and since the top-50 feature can be regarded as a ’master subset’, the superiority of the methods implemented in imbalance-XGBoost can be further corroborated.
6 Conclusion
This paper presents a novel Python-based package, namely Imbalance-XGBoost, for binary label-imbalanced classification with XGBoost. The package implemented weighted cross-entropy and focal loss functions on XGBoost, and it is fully compatible with the popular Scikit-learn package in Python. The design and usage of the package are introduced, and the discussion of methods and code listing examples provide a clear and comprehensive user guidance of the package. The theories and derivatives essential to the package are further discussed, and experiments based on Parkinson’s disease classification data are conducted with state-of-the-art performances illustrated. Overall, the package demonstrated in this paper successfully combines XGBoost with popular label-imbalance-robust loss functions and provides one of the most competitive performances up to date.
In summary, this paper has made three main contributions. Firstly, the paper has introduced a novel package that leverages the power of weighted and focal loss function for XGBoost, and it has huge potentials to be applied to a variety of real-life binary classification problems. Secondly, the paper has studied the theoretical foundations of the second-order approximation of XGBoost and has provided essential derivatives for the loss functions to be applied. The derivatives can also be applied to other fields in Machine Learning, and the equations in the merged form are convenient to be vectorized. And finally, the paper has offered new state-of-the-art performances on the Parkison’s disease classification data, and the emphasis of the imbalanced nature provides new a perspective to study the dataset.
In the future, the authors plan to keep maintaining the package and improving the quality of it by adding new features and further optimizing the codes. The software is open-source, and every member of the community is welcomed to contribute their own revisions of the program. Furthermore, the authors intend to add more evaluation score functions in the coming versions. An extension to multi-class label-imbalanced classification problems can be the plan of the longer future.
Acknowledgment
The authors would like to express thanks to Github users named icegrid and shaojunchao for reporting and correcting errors in the previous versions, and Github users with IDs olivierverdier and braingineer for providing latex solutions for code listings. In addition, Noel Chao of Dow Jones provided writing suggestions and proofreading for the paper, and the authors would like to express thanks to her.
References
- [1] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016.
- [2] Ching-Wei Wang, Yu-Ching Lee, Evelyne Calista, Fan Zhou, Hongtu Zhu, Ryohei Suzuki, Daisuke Komura, Shumpei Ishikawa, and Shih-Ping Cheng. A benchmark for comparing precision medicine methods in thyroid cancer diagnosis using tissue microarrays. Bioinformatics, 34(10):1767–1773, 2017.
- [3] Chen Wang, Suzhen Wang, Fuyan Shi, and Zaixiang Wang. Robust propensity score computation method based on machine learning with label-corrupted data. arXiv preprint arXiv:1801.03132, 2018.
- [4] Yung-Chia Chang, Kuei-Hu Chang, and Guan-Jhih Wu. Application of extreme gradient boosting trees in the construction of credit risk assessment models for financial institutions. Applied Soft Computing, 73:914–920, 2018.
- [5] Jyotsna Talreja Wassan, Haiying Wang, Fiona Browne, and Huiru Zheng. A comprehensive study on predicting functional role of metagenomes using machine learning methods. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 16(3):751–763, 2019.
- [6] Kamil Belkhayat Abou Omar. Xgboost and lgbm for porto seguro’s kaggle challenge: A comparison. Preprint Semester Project, 2018.
- [7] Didrik Nielsen. Tree boosting with xgboost-why does xgboost win "every" machine learning competition? Master’s thesis, NTNU, 2016.
- [8] Zhixun Zhao, Hui Peng, Chaowang Lan, Yi Zheng, Liang Fang, and Jinyan Li. Imbalance learning for the prediction of n 6-methylation sites in mrnas. BMC genomics, 19(1):574, 2018.
- [9] Ruisen Luo, Songyi Dian, Chen Wang, Peng Cheng, Zuodong Tang, YanMei Yu, and Shixiong Wang. Bagging of xgboost classifiers with random under-sampling and tomek link for noisy label-imbalanced data. In IOP Conference Series: 3rd International Conference on Automation, Control and Robotics Engineering (CACRE 2018), volume 428, page 012004. IOP Publishing, 2018.
- [10] Yanmin Sun, Andrew KC Wong, and Mohamed S Kamel. Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence, 23(04):687–719, 2009.
- [11] Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. Learning deep representation for imbalanced classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5375–5384, 2016.
- [12] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
- [13] Md Faisal Kabir and Simone Ludwig. Classification of breast cancer risk factors using several resampling approaches. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 1243–1248. IEEE, 2018.
- [14] Yufei Xia, Chuanzhe Liu, and Nana Liu. Cost-sensitive boosted tree for loan evaluation in peer-to-peer lending. Electronic Commerce Research and Applications, 24:30–49, 2017.
- [15] Wenbin Chen, Kun Fu, Jiawei Zuo, Xinwei Zheng, Tinglei Huang, and Wenjuan Ren. Radar emitter classification for large data set based on weighted-xgboost. IET Radar, Sonar & Navigation, 11(8):1203–1207, 2017.
- [16] Stefan Van Der Walt, S Chris Colbert, and Gael Varoquaux. The numpy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2):22, 2011.
- [17] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011.
- [18] Wes McKinney. pandas: a foundational python library for data analysis and statistics. Python for High Performance and Scientific Computing, 14, 2011.
- [19] Brian W Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2):442–451, 1975.
- [20] David Martin Powers. Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. Journal of Machine Learning Technologies, 2011.
- [21] Erin L Allwein, Robert E Schapire, and Yoram Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of machine learning research, 1(Dec):113–141, 2000.
- [22] Günther Eibl and Karl-Peter Pfeiffer. Multiclass boosting for weak classifiers. Journal of Machine Learning Research, 6(Feb):189–210, 2005.
- [23] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
- [24] Alexey Natekin and Alois Knoll. Gradient boosting machines, a tutorial. Frontiers in neurorobotics, 7:21, 2013.
- [25] Stephen Tyree, Kilian Q Weinberger, Kunal Agrawal, and Jennifer Paykin. Parallel boosted regression trees for web search ranking. In Proceedings of the 20th international conference on World wide web, pages 387–396. ACM, 2011.
- [26] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, pages 3146–3154, 2017.
- [27] Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. Catboost: unbiased boosting with categorical features. In Advances in Neural Information Processing Systems, pages 6638–6648, 2018.
- [28] Xiaoliang Wang, Peng Cheng, Xinchuan Liu, and Benedict Uzochukwu. Focal loss dense detector for vehicle surveillance. In 2018 International Conference on Intelligent Systems and Computer Vision (ISCV), pages 1–5. IEEE, 2018.
- [29] Ilke Demir, Krzysztof Koperski, David Lindenbaum, Guan Pang, Jing Huang, Saikat Basu, Forest Hughes, Devis Tuia, and Ramesh Raska. Deepglobe 2018: A challenge to parse the earth through satellite images. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 172–181. IEEE, 2018.
- [30] Tianqi Zhang, Li-Ying Hao, and Ge Guo. A feature enriching object detection framework with weak segmentation loss. Neurocomputing, 335:72–80, 2019.
- [31] Hongliang He, Wenyu Zhang, and Shuai Zhang. A novel ensemble method for credit scoring: Adaption of different imbalance ratios. Expert Systems with Applications, 98:105–117, 2018.
- [32] Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou. Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2):539–550, 2008.
- [33] Natalia Ponomareva, Soroush Radpour, Gilbert Hendry, Salem Haykal, Thomas Colthurst, Petr Mitrichev, and Alexander Grushetsky. Tf boosted trees: A scalable tensorflow based framework for gradient boosting. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 423–427. Springer, 2017.
- [34] Guillaume Lemaître, Fernando Nogueira, and Christos K Aridas. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research, 18(1):559–563, 2017.
- [35] Nicola Lunardon, Giovanna Menardi, and Nicola Torelli. Rose: A package for binary imbalanced learning. R journal, 6(1), 2014.
- [36] Jesús Alcalá-Fdez, Alberto Fernández, Julián Luengo, Joaquín Derrac, Salvador García, Luciano Sánchez, and Francisco Herrera. Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic & Soft Computing, 17, 2011.
- [37] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. The weka data mining software: an update. ACM SIGKDD explorations newsletter, 11(1):10–18, 2009.
- [38] Chongsheng Zhang, Jingjun Bi, Shixin Xu, Enislay Ramentol, Gaojuan Fan, Baojun Qiao, and Hamido Fujita. Multi-imbalance: An open-source software for multi-class imbalance learning. Knowledge-Based Systems, 174:137–143, 2019.
- [39] C Okan Sakar, Gorkem Serbes, Aysegul Gunduz, Hunkar C Tunc, Hatice Nizam, Betul Erdogdu Sakar, Melih Tutuncu, Tarkan Aydin, M Erdem Isenkul, and Hulya Apaydin. A comparative analysis of speech signal processing algorithms for parkinson’s disease classification and the use of the tunable q-factor wavelet transform. Applied Soft Computing, 74:255–263, 2019.
- [40] Hanchuan Peng, Fuhui Long, and Chris Ding. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis & Machine Intelligence, 27(8):1226–1238, 2005.
- [41] Salvador García, Julián Luengo, and Francisco Herrera. Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowledge-Based Systems, 98:1–29, 2016.