Subset Selection for Multiple Linear Regression via Optimization

Abstract

Subset selection in multiple linear regression aims to choose a subset ofcandidate explanatory variables that tradeoff fitting error (explanatory power)and model complexity (number of variables selected). We build mathematicalprogramming models for regression subset selection based on mean square andabsolute errors, and minimal-redundancy-maximal-relevance criteria. Theproposed models are tested using a linear-program-based branch-and-boundalgorithm with tailored valid inequalities and big M values and are comparedagainst the algorithms in the literature. For high dimensional cases, aniterative heuristic algorithm is proposed based on the mathematical programmingmodels and a core set concept, and a randomized version of the algorithm isderived to guarantee convergence to the global optimum. From the computationalexperiments, we find that our models quickly find a quality solution while therest of the time is spent to prove optimality; the iterative algorithms findsolutions in a relatively short time and are competitive compared tostate-of-the-art algorithms; using ad-hoc big M values is not recommended.

Quick Read (beta)

loading the full paper ...