(635f) Statistical Machine Learning for the DOW Data Challenge Problem | AIChE

(635f) Statistical Machine Learning for the DOW Data Challenge Problem

Authors 

Qin, S. J. - Presenter, City University of Hong Kong
Guo, S., University of Southern California
Chiang, L., Dow Inc.
Castillo, I., The Dow Chemical Company
In this paper, we present a statistical machine learning approach to the DOW challenge dataset, which is obtained from a multi-column integrated process, as shown in Figure 1 (Braun et al., 2020). The process is composed of three key distillation columns, Primary Column, Feed Column, and Secondary Column. The main objective of the challenge is to identify key variables that affect impurity levels measured at the primary column outlet from more than 40 process variables. The challenge is to build a high-precision inferential sensor model to predict the impurity. To benchmark various solutions, a validation dataset is provided in addition to the training dataset, which contains data collected over a year of time. The validation dataset should not be used in any way for modeling, or to determine any hyperparameters in the modeling phase. It can only be used for showing the accuracy of the inferential sensor model.

Our proposed solution is a statistical machine learning approach which consists of i) process data exploratory analysis, ii) a method for variable selection, iii) a method to deal with non-negative physical property modeling using a soft-plus function; and iv) a method for real-time bias updating based on known data. We benchmark main algorithms among partial least squares (PLS), lasso, and the least angle regression solution (LARS). We demonstrate using the validation dataset that our method gives superior prediction results. The pros and cons of the statistical machine learning methods are given with practical implications for industrial users. We make use of and emphasize on the importance of domain knowledge in exploratory analysis and feature selections. We report the identification of mode-switching operation in the data that leads to proper data pre-processing and interpolations found in the impurity data. We provide a solution for irregularly sampled quality data modeling, which shows that it is unnecessary to interpolate the lab-test impurity data.

Figure 1. DOW Challenge process flowchart from which the datasets were collected.

References

  1. Braun, I. Castillo, M. Joswiak, Y. Peng, R. Rendell, A. Schmidt, Z. Wang, L. Chiang, and B. Colegrove, Data Science Challenges in Chemical Manufacturing, IFAC World Congress, July 2020, Berlin, Germany.