(415c) Regularized Subsets: A Framework for High-Dimensional Linear Regression with Noisy Data and Optional Constraints
AIChE Annual Meeting
2021
2021 Annual Meeting
Computing and Systems Technology Division
Data-Driven and Hybrid Modeling for Decision Making I
Wednesday, November 10, 2021 - 8:30am to 8:45am
Building accurate linear models using high-dimensional, noisy data is a very challenging statistical task. Still, high-dimensional linear regression is popular because it leads to models that are easy to understand and simpler to fit than alternative methods. Subset selection (SS) is a methodology for constructing sparse linear regression models by either penalizing according to or constraining the number of variables selected in the model. Subset selection works by selecting a subset of potential regression variables to include into the regression model and calculating their coefficients using Ordinary Least Squares [1]. However, SS leads to model that fit the data poorly when there is a large amount of noise in the data. Research shows that applying shrinkage to the coefficients of variables selected by l0-regression greatly enhances the model's robustness in these scenarios [2]. Consequently, we propose the flexible Regularized Subsets framework that performs variable selection and coefficient shrinkage in two-stages: first using any subset selection to select variables, then using a convex penalty to shrink coefficients. Our experiments show that Regularized Subsets out-performs other popular linear model-building approaches in the literature for a variety of problem settings, including other approaches that use a convex regularization penalty in the context of subset selection. In addition, Regularized Subsets is able to build models from a candidate set of thousands of variables in seconds. We further demonstrate how the Regularized Subsets framework can easily be extended to handle linear equality and inequality constraints that allow users to incorporate prior knowledge into the model-building process. We show that our approach leads to significantly sparser models than the current most popular method for constrained linear regression--the Constrained Lasso [3]--with similar predictive performance. Finally, we provide our model building framework as an open-source tool.
[1] A Cozad, NV Sahinidis, DC Miller. Learning surrogate models for simulation-based optimization. AIChE Journal, 2016.
[2] O Sarwar, B Sauk, NV Sahinidis. A Discussion on Practical Considerations with Sparse Regression Methodologies. Statistical Science, 2020.
[3] BR Gaines, J Kim, H Zhou. Algorithms for fitting the Constrained Lasso. Journal of Computational and Graphical Statistics, 2018.