Quantifying Inherent Randomness in Machine Learning Algorithms

Abstract

Most machine learning (ML) algorithms have several stochastic elements, andtheir performances are affected by these sources of randomness. This paper usesan empirical study to systematically examine the effects of two sources:randomness in model training and randomness in the partitioning of a datasetinto training and test subsets. We quantify and compare the magnitude of thevariation in predictive performance for the following ML algorithms: RandomForests (RFs), Gradient Boosting Machines (GBMs), and Feedforward NeuralNetworks (FFNNs). Among the different algorithms, randomness in model trainingcauses larger variation for FFNNs compared to tree-based methods. This is to beexpected as FFNNs have more stochastic elements that are part of their modelinitialization and training. We also found that random splitting of datasetsleads to higher variation compared to the inherent randomness from modeltraining. The variation from data splitting can be a major issue if theoriginal dataset has considerable heterogeneity. Keywords: Model Training, Reproducibility, Variation

Quick Read (beta)

loading the full paper ...