(103g) Subset Selection in Multiple Linear Regression
AIChE Annual Meeting
2014
2014 AIChE Annual Meeting
Computing and Systems Technology Division
Advances in Data Analysis: Theory and Applications
Monday, November 17, 2014 - 2:24pm to 2:43pm
The ALAMO methodology has been developed recently to address the problem of discovering simple algebraic models from data obtained from simulations or experiments [1]. An important part of building these surrogate models is selecting the best subset from a large number of linear and nonlinear functions of explanatory variables. The selected subset will balance model complexity with the goodness-of-fit of the model in order to uncover underlying physical relationships instead of overfitting to the noise in the data.
The purpose of this paper is to present a systematic analysis of new and existing approaches to the subset selection problem encountered in ALAMO. The same problem of subset selection arises naturally in a variety of applications and has been the subject of study in the machine learning and statistics literatures [2, 3]. Yet, an effective solution approach to this problem is still elusive due to its highly combinatorial and nonlinear nature. It is often the practice to use greedy step-wise heuristics to produce a good fitting subset of regression variables [2]. These heuristics typically use different model fitness metrics, including Akaike’s Information Criterion and Mallows’ Cp, in order to define a stopping point. We compare these heuristic stepwise algorithms, exhaustive search algorithms [3], and newly proposed direct optimization of integer programming formulations for several different model selection criteria. For this purpose, we use a large test set with problems from a variety of applications.
References
[1] Cozad, A., N. V. Sahinidis, and D. C. Miller, Automatic learning of algebraic models for optimization, AIChE Journal, 60, 2211-2227, 2014.
[2] Miller, A. J. (1990). Subset selection in regression. London [England]: Chapman and Hall.
[3] Furnival, G.M. and R. W. Wilson (1974). Regression by leaps and bounds, Technometrics, 16, 499-511.