(371z) Extrapolation Error Quantification for the Discovery of Optimal Experimental Conditions | AIChE

(371z) Extrapolation Error Quantification for the Discovery of Optimal Experimental Conditions

Authors 

Kim, J., Incheon National University
Jeong, W., Sungkyunkwan University
IM, H., Sungkyunkwan University
Lee, Y. S., Sungkyunkwan University
Lee, J., Sungkyunkwan University
Numerous researchers want to find a powerful tool for reducing time and cost consumption with high-performance experimental conditions. To achieve their goal, researchers traditionally used response surface methodology (RSM), a statistical optimization method that examines the relation between the input factors and the output objective function. The objective of RSM is an optimization input factors to maximize the output value in the objective function [1, 2]. RSM is a methodology that includes design of experiment (DoE), formulation step, model evaluation step, and optimization step to identify system performance. DoE is a methodology that aims to obtain maximum information with a few experiments [3]. However, regression model in formulation step only considers the linearity of relationship between feature values and target values. Due to this problem, regression model can’t capture nonlinearity of relationship between feature values and target values. Artificial intelligence (AI) has recently emerged as a substitute to inability to solve nonlinearity of RSM [4]. Some researchers used AI tools such as Bayesian optimization model, and machine learning (ML) model for the discovery of optimal points of experiment.

However, if the test dataset is outside the scope of the training dataset, extrapolation occurs. Since ML models are designed to identify patterns from data within a specific range, extrapolation can be challenging because models may not be able to generalize well to data outside of their intended range. AI model just considers the train dataset and has encountered problems with extrapolation accuracy because of difficulty in extrapolation. Bayesian optimization (BO) uses gaussian process regression (GPR) for the surrogate model to predict values and display uncertainty. However, BO has a trustworthiness problem because GPR model has a low accuracy due to sampling-based training. Extreme gradient boosting (XGboost), one of the machine learning models, is a tree-based ML model and weak in extrapolation because of decision tree algorithm’s feature. A tree-based algorithm is pruning by finding a way to divide feature space optimally, and finally, the divided feature space has the average value in regression. Since the outer area without data cannot be pruned, prediction values are constant regardless of the change in the hyperparameter. This problem affects not only the accuracy but also the trustworthiness of extrapolation.

To solve the trustworthiness problem, we used multilayer perceptron (MLP), one of the ANNs algorithms. MLP can perform extrapolation well by rectified linear unit (ReLU) activation function when the training distribution covers a sufficiently broad range of the input space. In this study, we quantified extrapolation error to recommend experiment points. We used catalytic experiment data of oxidative coupling of methane (OCM) which included experimental conditions and results of each catalyst for the case study. OCM reaction can directly convert methane reacted with oxygen into ethane and ethylene (C2 hydrocarbon). OCM catalytic reaction dataset consists of catalyst information, catalyst experiment conditions, and catalyst experimental results. We developed an MLP model to predict C2 hydrocarbon yield by using reaction temperature, feed ratio of oxygen and methane, and contact time as the input variable as an example of Mn-Na2WO4/SiO2 catalyst.

We developed an interpolation model for mapping the OCM catalytic reaction data. Before interpolation, the OCM data were standardized by z-score to prevent overfitting. Additionally, standardized datasets were randomly split into train sets and test sets with the split ratio of 8/2, respectively. This study employed XGboost as an interpolation model due to its high accuracy and low computational time. Then, the train set was divided through 10-fold to evaluate the prediction accuracy and optimize hyperparameters (eg, n_estimators, max_depth, learning_rate) by Optuna library package of Python. The prediction accuracy is evaluated in R2-score and root mean squared error (RMSE). Using the interpolation model, the reaction performance of catalysts (C2 yield) is interpolated according to different reaction conditions. As a result, 9,261 augmentation data were generated by XGboost-based interpolation model.

Also, We developed an MLP-based extrapolation model to quantify extrapolation error. Keras library package of Python is used to create MLP. 9,380 data including 9,261 augmentation data and 119 real data was divided into 4 parts based on interquartile range (IQR) by each reaction condition. Then, MLP model trained in one or more parts of each reaction condition and predicted the parts except the training parts. Extrapolation error is defined as the difference between the real value and the prediction value of the extrapolation model, calculated in each training scenario and quantified by the distance between the population mean and value, the number of extrapolated variables.

In conclusion, we proposed the framework to quantify the extrapolation error in the experiment case of Mn-Na2WO4/SiO2 catalyst for OCM reaction. The XGboost-based interpolation model interpolated data for data augmentation. Then, we developed the MLP-based extrapolation model and quantified the extrapolation error using the augmentation data. The methodology of extrapolation error quantification could support recommendation points of additional experiments to increase the accuracy of the extrapolation model. Accordingly, the improvement of the extrapolation model’s accuracy would make the extrapolation model prediction more trustworthy to researchers.

References

  1. Dong, Y., C. Georgakis, J. Mustakis, et al., “Constrained Version of the Dynamic Response Surface Methodology for Modeling Pharmaceutical Reactions,” Industrial and Engineering Chemistry Research, 58 (30), pp. 13611–13621 (2019).
  2. Schüler, C., F. Betzenbichler, C. Drescher, and O. Hinrichsen, “Optimization of the synthesis of Ni catalysts via chemical vapor deposition by response surface methodology,” Chemical Engineering Research and Design, 132, pp. 303–312 (2018).
  3. Elfghi, F.M., “A hybrid statistical approach for modeling and optimization of RON: A comparative study and combined application of response surface methodology (RSM) and artificial neural network (ANN) based on design of experiment (DOE),” Chemical Engineering Research and Design, 113, pp. 264–272 (2016).
  4. Mazheika, A., Y.G. Wang, R. Valero, et al., “Artificial-intelligence-driven discovery of catalyst genes with application to CO2 activation on semiconductor oxides,” Nature Communications, 13 (1), (2022).