(393a) A Metaheuristic Approach to Best Subset Selection for the Development of Regression-Based Surrogate Models | AIChE

(393a) A Metaheuristic Approach to Best Subset Selection for the Development of Regression-Based Surrogate Models

Authors 

Sarwar, O. - Presenter, Carnegie Mellon University
Sahinidis, N., Carnegie Mellon University
Learning algebraic models from data is becoming an increasingly popular method to model complex chemical engineering systems. This trend is driven by the difficulty of constructing and solving physics-based models, increases in available system and experimental data, and improved strategies for regression. In this work, we focus on predicting the value of a single output variable using a linear combination of features. These features include the regressor variables as well as nonlinear transformations of the regressors.

Well-known strategies for linear regression include Forward/Backward Selection, Ordinary-Least Squares, and Best Subset Selection (BSS). BSS seeks to select coefficient values for the features as to minimize the least-squares error between the actual and predicted values of the output while also penalizing the inclusion of features in the objective. This task can be posed as a mixed-integer optimization (MIO) problem and leads to simple models with good accuracy. However, as the amount of data points and the number of potential features grows, the resulting combinatorial explosion makes the MIO problem intractable to solve in a reasonable amount of time.

In this paper, we present a metaheuristic-based algorithm for solving the Best-Subset-Selection problem. Our strategy includes a stochastic global search and a deterministic local search. We begin by defining neighborhoods in the combinatorial space whose sizes are based-on the Hamming distance from a current point. Within this neighborhood, we randomly select a new point in the combinatorial space and optimize the continuous variable. Then, we perform a local search in the neighborhood of the current point. The local search is done for small neighborhoods via exhaustive enumeration of combinatorial solutions and application of well-known updates for the coefficient vector or by using a commercial solver for large neighborhoods. If a better solution is found, the local search is re-centered at the current best point. If no new solution is found, the neighborhood size is increased until the maximum neighborhood size is reached and the algorithm moves back into the global search phase. We demonstrate our approach on a wide range of test data and show that our approach yields accurate models in a tractable manner.