(714c) Comparison of Different Variable Selection Methods for PLS Soft Sensor Development
AIChE Annual Meeting
2013
2013 AIChE Annual Meeting
Computing and Systems Technology Division
Big Data Applications in Chemical Engineering
Thursday, November 7, 2013 - 3:55pm to 4:15pm
During the past two decades, rapid developments in technology facilitated the collection of vast amount of data from different industrial processes. The data has been utilized in many different areas, such as data-driven soft sensor development and process monitoring, to control and optimize the process. The performance of these data-driven schemes can be greatly improved by selecting only the vital variables that strongly affect the primary variables, rather than all the available process variables. Consequently, variable selection has been one of the most important practical concerns in data-driven approaches. By identifying the irrelevant and redundant variables, variable selection can improve the prediction performance, reduce the computational load and model complexity, obtain better insight into the nature of the process, and lower the cost of measurements [1], [2].
A comprehensive evaluation of different variable selection methods for soft sensor development is presented in this work. Among all the variable selection methods, seven algorithms are investigated. They are stepwise regression (SR) [3], partial least squares (PLS) with regression coefficients (PLS-BETA), PLS with variable importance in projection (PLS-VIP) [4], uninformative variable elimination with PLS (UVE-PLS) [5], genetic algorithm with PLS (GA-PLS) [6–8], removing irrelevant variables amidst Lasso iterations (RIVAL) [9], and competitive adaptive reweighted sampling with PLS (CARS-PLS) [10].
The algorithms of these variable selection methods and their characteristics are presented. More importantly, the strength and limitations when applied for soft sensor development are illustrated by two case studies. A simple simulation case is used to investigate the properties of the selected variable selection methods. The dataset is generated to mimic the typical characteristics of process data, such as the magnitude of correlations between variables and the magnitude of signal to noise ratio [4]. In addition, the algorithms are applied to an industrial soft sensor case study, the production of polyester resin [11].
In previous work, the methods were compared using their own literature suggested input parameters without any tuning. The results indicated PLS-VIP outperformed other methods, in terms of both prediction performance and effectiveness of identifying key variables. In this work, independent datasets are used to optimize each variable selection method and to analyze the sensitivity of each method to its tuning parameters. In both simulated and industrial cases, each dataset is divided into training, validation, and testing sets. The training set is used to build a series of reduced models with suggested range of tuning parameters for each variable selection algorithm, while validation set is utilized to determine the best tuning parameters. Finally, the soft sensor prediction performances on the independent testing sets by PLS are used to provide a fair comparison and analysis of different algorithms. The final performances are compared based on their prediction performance, sensitivity, robustness, and computation load. Ultimately by taking these criterions into consideration, it is concluded that PLS-VIP outperforms other alternative methods.
References
[1] C. M. Andersen and R. Bro, “Variable selection in regression—a tutorial,” Journal of Chemometrics, vol. 24, no. 11-12, pp. 728–737, Nov. 2010.
[2] J. Reunanen, “Overfitting in making comparisons between variable selection methods,” The Journal of Machine Learning Research, vol. 3, pp. 1371–1382, 2003.
[3] M.-D. Ma, J.-W. Ko, S.-J. Wang, M.-F. Wu, S.-S. Jang, S.-S. Shieh, and D. S.-H. Wong, “Development of adaptive soft sensor based on statistical identification of key variables,” Control Engineering Practice, vol. 17, no. 9, pp. 1026–1034, Sep. 2009.
[4] I.-G. Chong and C.-H. Jun, “Performance of some variable selection methods when multicollinearity is present,” Chemometrics and Intelligent Laboratory Systems, vol. 78, no. 1–2, pp. 103–112, Jul. 2005.
[5] V. Centner, D. L. Massart, O. E. de Noord, S. de Jong, B. M. Vandeginste, and C. Sterna, “Elimination of uninformative variables for multivariate calibration.,” Analytical chemistry, vol. 68, no. 21, pp. 3851–3858, Nov. 1996.
[6] R. Leardi and A. Lupiáñez González, “Genetic algorithms applied to feature selection in PLS regression: how and when to use them,” Chemometrics and Intelligent Laboratory Systems, vol. 41, no. 2, pp. 195–207, Jul. 1998.
[7] D. Whitley, “A genetic algorithm tutorial,” Statistics and Computing, vol. 4, no. 2, Jun. 1994.
[8] D. Broadhursta, J. J. Rowlandb, and D. B. Kelp, “Genetic algorithms as a method for variable selection in multiple linear regression and partial least squares regression , with applications to pyrolysis mass spectrometry,” Analytica Chimica Acta, vol. 348, pp. 71–86, 1997.
[9] P. Kump, E.-W. Bai, K. Chan, B. Eichinger, and K. Li, “Variable selection via RIVAL (removing irrelevant variables amidst Lasso iterations) and its application to nuclear material detection,” Automatica, vol. 48, no. 9, pp. 2107–2115, Sep. 2012.
[10] H. Li, Y. Liang, Q. Xu, and D. Cao, “Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration.,” Analytica chimica acta, vol. 648, no. 1, pp. 77–84, Aug. 2009.
[11] P. Facco, F. Doplicher, F. Bezzo, and M. Barolo, “Moving average PLS soft sensor for online product quality estimation in an industrial batch polymerization process,” Journal of Process Control, vol. 19, no. 3, pp. 520–529, Mar. 2009.