(415c) Regularized Subsets: A Framework for High-Dimensional Linear Regression with Noisy Data and Optional Constraints

Conference

AIChE Annual Meeting

Year

2021

Proceeding

2021 Annual Meeting

Group

Computing and Systems Technology Division

Session

Data-Driven and Hybrid Modeling for Decision Making I

Time

Wednesday, November 10, 2021 - 8:30am to 8:45am

Authors

Sarwar, O. - Presenter, Carnegie Mellon University

Sahinidis, N., Carnegie Mellon University

Building accurate linear models using high-dimensional, noisy data is a very challenging statistical task. Still, high-dimensional linear regression is popular because it leads to models that are easy to understand and simpler to fit than alternative methods. Subset selection (SS) is a methodology for constructing sparse linear regression models by either penalizing according to or constraining the number of variables selected in the model. Subset selection works by selecting a subset of potential regression variables to include into the regression model and calculating their coefficients using Ordinary Least Squares [1]. However, SS leads to model that fit the data poorly when there is a large amount of noise in the data. Research shows that applying shrinkage to the coefficients of variables selected by l0-regression greatly enhances the model's robustness in these scenarios [2]. Consequently, we propose the flexible Regularized Subsets framework that performs variable selection and coefficient shrinkage in two-stages: first using any subset selection to select variables, then using a convex penalty to shrink coefficients. Our experiments show that Regularized Subsets out-performs other popular linear model-building approaches in the literature for a variety of problem settings, including other approaches that use a convex regularization penalty in the context of subset selection. In addition, Regularized Subsets is able to build models from a candidate set of thousands of variables in seconds. We further demonstrate how the Regularized Subsets framework can easily be extended to handle linear equality and inequality constraints that allow users to incorporate prior knowledge into the model-building process. We show that our approach leads to significantly sparser models than the current most popular method for constrained linear regression--the Constrained Lasso [3]--with similar predictive performance. Finally, we provide our model building framework as an open-source tool.

[1] A Cozad, NV Sahinidis, DC Miller. Learning surrogate models for simulation-based optimization. AIChE Journal, 2016.
[2] O Sarwar, B Sauk, NV Sahinidis. A Discussion on Practical Considerations with Sparse Regression Methodologies. Statistical Science, 2020.
[3] BR Gaines, J Kim, H Zhou. Algorithms for fitting the Constrained Lasso. Journal of Computational and Graphical Statistics, 2018.

Topics

Computing and Systems Engineering

Other Sites & Tools

Technical Groups

Technical

Professional/Personal Growth

Societal Needs

Leadership

2025 Spring Meeting and 21st Global Congress on Process Safety

2025 AIChE Annual Meeting

Upcoming Conferences & Events

CEP: November 2024

CEP: October 2024

Explore Areas of Advancement:

Learning Center:

Want to be an Entrepreneur? Personal Stories From Three Successful Entrepreneurs Who Have Traveled This Path.