(33c) An Open-Source Tool for Implementing and Comparing Sparse Regression Methods | AIChE

(33c) An Open-Source Tool for Implementing and Comparing Sparse Regression Methods

Authors 

Sarwar, O. - Presenter, Carnegie Mellon University
Sahinidis, N. - Presenter, Carnegie Mellon University
Hubbs, C. D., The Dow Chemical Company
Using data to predict the value of an output variable (response) from a set of input variables (regressors), data-derived regression models have become popular across disciplines in science and engineering. Linear regression surrogates are powerful substitutes for physics-based models because they allow complex processes to be represented by simple equations that can be derived quickly. Sparse regression is a model-building paradigm that assumes that the regressor can be predicted by only a few of the many regressor and penalizes overly-complex models. Sparse linear regression is useful in engineering because the measured regression variables must be nonlinearly-transformed into many more regressors in order to effectively capture complexity—leading to high-dimensional problems. There are many algorithms for regression including subset selection methods, Lasso-based methods, nonconvex penalties like MCP/SCAD, and others. Unfortunately for practitioners, there is little guidance available for choosing between methods and the process of regression is usually trial-and-error.

In this work, we systematically study various types of data and problem settings to help users pick a regression method for practical application. To the extent possible, we connect empirical results to the theory. We start by building upon some previous work that assumes that the underlying model is linear and then compares sparse linear regression methods in their ability to recover the true feature set without selecting many irrelevant features, and to accurately predict. We then focus on the case where the underlying model is not assumed linear, as is the case in many engineering applications. First defining performance metrics, we then examine many different problem settings to see which sparse regression method performs best. These experiments are done with synthetic and real data.

Finally, we want users to be able to compare regression methods for their own application in a single step. Consequently, we release a framework for regression comparison and general model-building in an open-source Python package that aggregates numerous popular regression methods (sparse linear-regression algorithms and others, as well), feature-engineering methods, and dimensionality reduction-techniques into a single common, easy-to-use interface. The intentions of this package are to allow users, in one step, to: (1) Take their own data and easily build, compare, and pick models from various different methods, (2) Investigate the relative performance of various methods using synthetic data, and (3) Benchmark their novel regression methods against those in the package.