(415c) Regularized Subsets: A Framework for High-Dimensional Linear Regression with Noisy Data and Optional Constraints | AIChE

(415c) Regularized Subsets: A Framework for High-Dimensional Linear Regression with Noisy Data and Optional Constraints

Authors 

Sarwar, O. - Presenter, Carnegie Mellon University
Sahinidis, N., Carnegie Mellon University
Building accurate linear models using high-dimensional, noisy data is a very challenging statistical task. Still, high-dimensional linear regression is popular because it leads to models that are easy to understand and simpler to fit than alternative methods. Subset selection (SS) is a methodology for constructing sparse linear regression models by either penalizing according to or constraining the number of variables selected in the model. Subset selection works by selecting a subset of potential regression variables to include into the regression model and calculating their coefficients using Ordinary Least Squares [1]. However, SS leads to model that fit the data poorly when there is a large amount of noise in the data. Research shows that applying shrinkage to the coefficients of variables selected by l0-regression greatly enhances the model's robustness in these scenarios [2]. Consequently, we propose the flexible Regularized Subsets framework that performs variable selection and coefficient shrinkage in two-stages: first using any subset selection to select variables, then using a convex penalty to shrink coefficients. Our experiments show that Regularized Subsets out-performs other popular linear model-building approaches in the literature for a variety of problem settings, including other approaches that use a convex regularization penalty in the context of subset selection. In addition, Regularized Subsets is able to build models from a candidate set of thousands of variables in seconds. We further demonstrate how the Regularized Subsets framework can easily be extended to handle linear equality and inequality constraints that allow users to incorporate prior knowledge into the model-building process. We show that our approach leads to significantly sparser models than the current most popular method for constrained linear regression--the Constrained Lasso [3]--with similar predictive performance. Finally, we provide our model building framework as an open-source tool.

[1] A Cozad, NV Sahinidis, DC Miller. Learning surrogate models for simulation-based optimization. AIChE Journal, 2016.
[2] O Sarwar, B Sauk, NV Sahinidis. A Discussion on Practical Considerations with Sparse Regression Methodologies. Statistical Science, 2020.
[3] BR Gaines, J Kim, H Zhou. Algorithms for fitting the Constrained Lasso. Journal of Computational and Graphical Statistics, 2018.