(415g) Machine Learning and Multi-Way Method Modelling Methods for Pharmaceutical Process Quality
AIChE Annual Meeting
2021
2021 Annual Meeting
Computing and Systems Technology Division
Data-Driven and Hybrid Modeling for Decision Making I
Wednesday, November 10, 2021 - 9:30am to 9:45am
There is a vast amount of historical data for use in statistical and data-driven models. These types of models rely only on finding correlations between input and outputs detected by a choice of various machine learning algorithms or multivariate methods with the likes of Partial Least Squares (PLS), Artificial Neural Networks (ANN), Support Vector Machines (SVM), Gaussian Process Regression (GPR) and more. Many of these machine learning algorithms have been successfully used for applications in the field of process engineering with ANN receiving a great deal of interest due to its flexibility to model any input-output correlation.
All the aforementioned methods however rely on a two-way structure of the dataset which can be represented in matrix form where the rows and columns denote observations and variables respectively. Essentially the data is represented by input matrix X which are measured variables and output matrix Y which are variables that are to be predicted by the model. This two-way structure can describe a single batch. However, with multiple batches the data structure becomes three-dimensional with batch indices, variable indices and time indices. The traditional approach is to unfold the data into the two-way matrix form. This however creates a problem due to different batch lengths and improper synchronization of events that will create artifacts that affect the final model. Preprocessing techniques like Dynamic Time Warping (DTW) can be used to synchronize batch trajectories but this method is subjective and time consuming.
An alternative approach is use multi-way techniques for modelling the dataset. Parallel Factor Analysis (PARAFAC) and PARAFAC2 models are widely used in chemistry for processing spectral and chromatographic data. Both models have shown excellent robustness to noisy and missing data and require no data unfolding. PARAFAC2 is especially promising as it does not assume synchronization of batch trajectories and thus does not require any preprocessing with regards to time warping. Multi-way models are extremely powerful tools when analyzing 3D data structures. However, their application to bioprocessing has been not commonly applied or explored.
In this work, historical data of over 60 batches are analyzed from a production process operating at LEO Pharma A/S. Comparison of different data driven methods are analyzed ranging from multivariate bilinear models, machine learning models and multi-way models. The aim of the models is predicting final product concentrations and final batch quality, which is measured offline at the harvest of each batch. Many modelling iterations are considered, utilizing different pre-processing, validation, and variable selection methods. The final model quality is based on the average prediction error of both product concentration and batch quality. This research aims to produce a consistent data-driven modeling methodology that can be readily and reliably applied to pharmaceutical production to ensure high quality and consistent batch-to-batch manufacturing.