Efficient Identification of Approximate Best Configuration of Training in Large Datasets

Abstract

A configuration of training refers to the combinations of featureengineering, learner, and its associated hyperparameters. Given a set ofconfigurations and a large dataset randomly split into training and testingset, we study how to efficiently identify the best configuration withapproximately the highest testing accuracy when trained from the training set.To guarantee small accuracy loss, we develop a solution using confidenceinterval (CI)-based progressive sampling and pruning strategy. Compared tousing full data to find the exact best configuration, our solution achievesmore than two orders of magnitude speedup, while the returned top configurationhas identical or close test accuracy.

Quick Read (beta)

loading the full paper ...