(135b) Sample Size Determination for Metamodel Building in Automated Machine Learning Pipelines By an Inclusive Feedback Algorithm | AIChE

(135b) Sample Size Determination for Metamodel Building in Automated Machine Learning Pipelines By an Inclusive Feedback Algorithm

Authors 

Bispo, H. - Presenter, Federal University of Campina Grande
Brandão, A. C. - Presenter, Federal University of Campina Grande
Fernandes, T. C. R., Federal University of Campina Grande
This contribution aims at systematically determining the required sample size to build accurate metamodels by means of a novel framework that uses a feedback control strategy to drive an Automated Machine Learning pipeline, which is designed to produce simple regression models whenever possible, or else generate more complex, less desirable nonlinear models. At each iteration of the proposed algorithm, the feedback scheme computes the number of samples to be placed in the input space by sequential design of experiment techniques, which query the actual process on demand. The intended goal is to minimize the number of calls to the data generating system by processing all responses together within the same iteration step. This is achieved by a unique inclusive strategy where each generated dataset is exhaustively and effectively used up to converge even difficult responses until a stopping criterion on the error score is met. This contribution introduces three new features, which can be viewed as a paradigm shift in the field, namely: (i) a proportional controller to bring the error score to zero by determining the appropriate number of samples, (ii) a steady state detection routine to halt metamodel training when no progress is made, and (iii) an inclusive strategy to simultaneously handle multiple responses. The first two are brought from the field of Process Control Engineering, revealing the interdisciplinary nature of the algorithm. Application of the proposed framework to important test cases shows its effectiveness in building metamodels with significant predictive capabilities with the right number of samples.

Automated Machine Learning aims at compiling all the necessary decisions in a data-driven, objective, and automated way to produce an algorithm that makes the practice of model building more efficient for non-experts (Thornton et al., 2013). For this audience, there are many popular freely available AutoML algorithms, including Auto-WEKA (Kotthoff et al., 2016, 2019; Thornton et al., 2013), AutoGluon-Tabular (Erickson et al., 2020), Auto-sklearn (Feurer et al., 2015), H2O AutoML (LeDell & Poirier, 2020), Sumo (Gorissen et al., 2010), Hyperopt-sklearn (Komer et al., 2014), TPOT (Olson et al., 2016), Auto-Keras (Jin et al., 2019), to name a few. A high-level overview of the proposed heuristic inclusive feedback iterative AutoML pipeline (HIFIAPP) for regression of process data is shown in Algorithm 1. All these ideas were embedded in an Automated Machine Learning pipeline designed to systematically produce simple structured metamodels, using a minimum number of samples in an amenable amount of time. The application of the algorithm to representative test cases revealed the effectiveness of the proposed strategy to promote the construction of surrogate models with significant predictive capabilities.

References

Erickson, N., Mueller, J., Shirkov, A., Zhang, H., Larroy, P., Li, M., & Smola, A. (2020). AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data.

Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., & Hutter, F. (2015). Efficient and Robust Automated Machine Learning. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 28 (NIPS 2015) (Vol. 28). Curran Associates, Inc.

Gorissen, D., Couckuyt, I., Demeester, P., Dhaene, T., & Crombecq, K. (2010). A Surrogate Modeling and Adaptative Sampling Toolbox for Computer Based Design. Journal of Machine Learning Research, 11, 2051–2055.

Jin, H., Song, Q., & Hu, X. (2019). Auto-Keras: An Efficient Neural Architecture Search System. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1946–1956. https://doi.org/10.1145/3292500.3330648

Komer, B., Bergstra, J., & Eliasmith, C. (2014). Hyperopt-Sklearn: Automatic Hyperparameter Configuration for Scikit-Learn. 32–37. https://doi.org/10.25080/Majora-14bd3278-006

Kotthoff, L., Thornton, C., Hoos, H. H., Hutter, F., & Leyton-Brown, K. (2016). Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. Journal of Machine Learning Research, 17, 1–5.

Kotthoff, L., Thornton, C., Hoos, H. H., Hutter, F., & Leyton-Brown, K. (2019). Auto-WEKA: Automatic Model Selection and Hyperparameter Optimization in WEKA (pp. 81–95). https://doi.org/10.1007/978-3-030-05318-5_4

LeDell, E., & Poirier, S. (2020). H2O AutoML: Scalable Automatic Machine Learning. In F. Hutter, J. Vanschoren, M. Lindauer, C. Weill, K. Eggensperger, & M. Feurer (Eds.), 7th ICML Workshop on Automated Machine Learning.

Olson, R. S., Bartley, N., Urbanowicz, R. J., & Moore, J. H. (2016). Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. Proceedings of the Genetic and Evolutionary Computation Conference 2016, 485–492. https://doi.org/10.1145/2908812.2908918

Thornton, C., Hutter, F., Hoos, H. H., & Leyton-Brown, K. (2013). Auto-WEKA. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 847–855. https://doi.org/10.1145/2487575.2487629