(340o) From Data to Decisions: Development of Surrogate Models for Process Optimization | AIChE

(340o) From Data to Decisions: Development of Surrogate Models for Process Optimization

Authors 

Williams, B. - Presenter, Auburn University
Research Interests:

Several areas in pharmaceutical manufacturing, including process design and process synthesis, and supply chain management, involve complex, high-fidelity simulations and physical experiments, requiring significant resources in terms of both cost and time. High resource requirements can present considerable challenges for modeling these complex processes, as the computational and monetary costs of collecting the necessary data may become prohibitive. In addition, optimizing these processes using traditional gradient-based methods may be impractical for these applications because gradient information is not readily available, or approximating gradients is infeasible due to the required expense for multiple simulation evaluations or experiments. To overcome these challenges, cheaper surrogate models that mimic the simulations’ overall behavior can be constructed and used in their place.

Surrogate models are used to map input data to output data when the actual relationship between the two is not well understood or computationally expensive to evaluate. Many techniques have been developed for surrogate modeling; however, a systematic method for selecting suitable techniques for an application remains an open challenge. My research focuses on how to best select both the form of the surrogate model and the data being used to construct it. My work has centered on three main projects: 1. Data-driven model development for prediction of cardiac differentiation experimental outcomes, 2. Selection of surrogate modeling techniques for surface approximation and surrogate-based optimization, and 3. Development of a Random Forest-based derivative free optimization algorithm for solution of constrained gray-box problems.

  1. Data-driven model development for prediction of cardiac differentiation experimental outcomes

Human cardiomyocytes (CMs) have potential for use in cell therapy and high-throughput drug screening. Because of the inability to expand adult CMs, their large-scale production from human pluripotent stem cells (hPSC) has been suggested. However, experiments for optimization of the differentiation process are costly, time-consuming, and highly variable, leading to challenges in developing reliable and consistent protocols for the generation of large CM numbers at high purity. This study examined the ability of data-driven modeling with machine learning for identifying key experimental conditions and predicting final CM content using data collected during hPSC-cardiac differentiation in advanced stirred tank bioreactors (STBR). Through feature selection, we identified process conditions, features, and patterns that are the most influential on and predictive of the CM content at the process endpoint, on differentiation day 10 (dd10). Process-related features were extracted from experimental data collected from 58 differentiation experiments by feature engineering. Models built using random forests and Gaussian process modeling predicted insufficient CM content for a differentiation process with 90% accuracy and precision on dd7 of the protocol and with 85% accuracy and 82% precision at a substantially earlier stage: dd5. These models provide insight into potential key factors affecting hPSC cardiac differentiation to aid in selecting future experimental conditions and can predict the final CM content at earlier process timepoints, providing cost and time savings. This study suggests that data-driven models and machine learning techniques can be employed using existing data for understanding and improving production of a specific cell type, which is potentially applicable to other lineages and critical for realization of their therapeutic applications (Williams et al. 2020).

  1. Selection of surrogate modeling techniques for surface approximation and surrogate-based optimization

The initial goal of this work was to comprehensively investigate and compare the performance of several different surrogate modeling techniques for both approximating functional relationships (surface approximation) and for surrogate-based optimization, and to link that performance to the characteristics of the data involved in the application. The specific data characteristics investigated in the comparison are the shape of the underlying function being modeled, the number of input dimensions, the sampling method used to generate the data, and the number of sample points in the dataset. The computational experiments revealed that there is a dependence of the surrogate modeling performance on the data characteristics. However, in general, multivariate adaptive regression spline models and Gaussian process regression yielded the most accurate predictions for approximating a surface. Random forests, support vector machines, and Gaussian process regression models most reliably identified the optimum locations and values when used for surrogate-based optimization (Williams and Cremaschi, 2021).

Using the results from the comparison, we developed PRESTO (Predictive REcommendation of Surrogate models to approximate and Optimize) , a Random Forest classifier-based tool, to recommend the appropriate surrogate modeling technique for a given dataset for surface approximation and surrogate-based optimization, using attributes calculated only from the available input and output values. The attributes include common statistical measures, such as mean and standard deviation, gradient-based attributes, and attributes related to the extrema of the output values. The tool identifies the appropriate surrogate modeling techniques for surface approximation with an accuracy of 91% and a precision of 90% and for surrogate-based optimization with an accuracy of 98% and a precision of 99%. PRESTO was also tested on a case study of data generated from a high fidelity process model of the cumene production process. The performance of PRESTO on the case study data was comparable to the performance on the simulated data the tool was trained with for surface approximation, with an accuracy and precision of 91% and 90%, respectively.

  1. Development of a Random Forest-based derivative free optimization algorithm for solution of constrained gray-box problems.

The results of the surrogate model comparison study indicates that surrogate models built using random forests (RFs) are exceptionally adept at locating optimum decision variable values. This research activity involves developing an algorithm for using RF models to reduce the bounds on the decision variables in an optimization problem by using the decision tree thresholds from trained models. When used for optimization, the RF models yield mixed-integer linear programming models. A single iteration of this algorithm consists of sampling, construction of a RF model surrogate approximation, global optimization of the constrained approximation problem, and collection of new sampling points to repeat the entire procedure until certain termination criteria are met. Preliminary results are promising and suggest the RFs are able to drastically reduce the decision variable bounds and optimization search space for many types of problems.

References:

Williams, B., Lobel, W., Finklea, F., Halloin, C., Ritzenhoff, K., Manstein, F., Mohammadi, S., Hashemi, M., Zweigerdt, R., Lipke, E., Cremaschi, S., 2020. Prediction of Human Induced Pluripotent Stem Cell Cardiac Differentiation Outcome by Multifactorial Process Modeling. Front Bioeng Biotechnol 8, 851. Doi: 10.3389/fbioe.2020.00851.

Williams, B., Cremaschi, S., 2021. Selection of surrogate modeling techniques for surface approximation and surrogate-based optimization. Chemical Engineering Research & Design 170, 76-89. Doi: 10.1016/j.cherd.2021.03.028.

Checkout

This paper has an Extended Abstract file available; you must purchase the conference proceedings to access it.

Checkout

Do you already own this?

Pricing

Individuals

AIChE Pro Members $150.00
AIChE Emeritus Members $105.00
AIChE Graduate Student Members Free
AIChE Undergraduate Student Members Free
AIChE Explorer Members $225.00
Non-Members $225.00