(243a) Elastic Net with Monte Carlo Sampling for Data-Based Modeling in Biopharmaceutical Manufacturing Facilities | AIChE

(243a) Elastic Net with Monte Carlo Sampling for Data-Based Modeling in Biopharmaceutical Manufacturing Facilities

Authors 

Biopharmaceutical manufacturing
involves multiple process steps that can be challenging to model using first
principles. Oftentimes, operating conditions are studied in bench-scale
experiments and then fixed to specific values during full-scale operations.
This procedure limits the opportunity to tune process variables to correct for
the effects of disturbances. Utilizing process models has the potential to
increase the flexibility and controllability of the biomanufacturing processes.
This work proposes a statistical modeling methodology to predict the outputs of
biopharmaceutical operations. This methodology addresses two important
challenging characteristics typical of data collected in the biopharmaceutical
industry: limited data availability and data heterogeneity. Motivated by the
final aim of control, regularization methods, specifically the elastic net, are
combined with sampling techniques similar to the bootstrap to develop
mathematical models that use only a small number of input variables. These techniques
are of particular interest because of their ability to perform model selection
and estimation simultaneously.

Process modeling techniques can be
grouped into two broad categories: first-principles and data-based. This work
focuses on data-based modeling, which is more often applied in biopharmaceutical
manufacturing facilities. Data-based models have been applied to cell culture
characterization [1] [2] [3], quality control [4] [5], process monitoring [3] [6]
[7] [8], and downstream operations [3]. A drawback of current data-based
methods applied in the biopharmaceutical industry is that the models that are
produced are not easily interpretable because they rely on subspaces that do
not have direct physical meaning.

A successful biopharmaceutical model
was defined as achieving three goals: (1) model accuracy, (2) model simplicity,
and (3) model interpretability. These aims have the caveat of using only a small
amount of heterogeneous data, as data for biopharmaceutical manufacturing are
typically both heterogeneous and relatively limited.

One way to achieve these goals is
through the identification of the input variables in the process that exhibit
the largest effects on the output variables. Regularization methods have been
identified as possible approaches for such problems because of their ability to
simultaneously handle input selection and model estimation [9]. A particular
regularization method, the elastic net [10], was identified because of its
ability to handle data with more measurements than observations. Elastic net is an optimization
formulation for parameter estimation that is formulated as:

,                                                                 (1)

where

,                                                                                                             (2)

N is the number of experiments, yi
is the ith scalar response, xi is the p-dimensional
data vector at observation i, λ is a nonnegative
regularization parameter, β0 is a scalar parameter, β
is a p-dimensional vector of model parameters, and α is on
the interval (0,1]. Using this basis, a five-step methodology, referred to as
the elastic net with Monte Carlo sampling (ENwMC) is proposed.

The first step in ENwMC is an
application of the elastic net, using leave-one-out cross validation to choose
the value of α. In leave-one-out cross validation, all but one of
the experimental observations are used to fit the model, then the remaining
experiment is used to calculate the error. This step is repeated for each
possible set and then averaged. The procedure is performed for many possible
combinations of the regularization parameters α and λ,
where α  and λ captures the convex behavior of the
error. Because α is the weighting between the ℓ2- and ℓ1-norm penalties and the goal is a
sparse model, a value of α close to 1 is preferable. Therefore α
is chosen based on a tradeoff between model dimensionality and prediction error.
In some cases, this choice is trivial, as a higher value leads to a more
accurate model.

Once the value of α is
fixed, a test for over-fitting is performed using k-fold cross
validation. Using Monte Carlo samples [11], the data are portioned into a validation
set containing 1/k proportion of the data and a calibration set
containing the rest. The elastic net, with a fixed α, is then
performed and the input variables corresponding to the minimum error are
recorded. This step is repeated many times to converge to the distribution of
models over the possible calibration and validation sets. The frequency with
which each variable is selected is then calculated. In further analysis, only
the variables that were selected above a threshold frequency are considered.

The subset of selected variables is
considered for inclusion in a model using best subset selection. The error of
all possible ordinary least squares models of size m , where p is now the dimensions that were
chosen based on the threshold, is calculated. A model from this set is then
selected based the tradeoff between increasing dimensionality and decreasing
error. This tradeoff is easily visualized by plotting the prediction error
against the model dimensions to create a Pareto curve. Plots of this type will
often exhibit an ?elbow.? The elbow corresponds to the model dimensionality
that compromises between model size and prediction error. The result of this
step is the final model.

Figure 1: Simplified
flowsheet of the antibody production process. The bioreactor volume was 2000L
and was operated in the fed-batch mode. The column loadings were typical of an
antibody purification process. [12].

The developed methodology is
evaluated on an antibody manufacturing dataset (see Figure 1) and compared to
well-known multivariate analysis techniques for the (bio)pharmaceutical field:
principal component regression (PCR) and partial least squares (PLS). In a
majority of cases, the elastic net technique out-performed PCR and PLS in terms
of error and variance (see Table 1). Averaged over all of the output variables
that were considered, the sum-of-squared errors decreased 27% and the variance
decreased 48% using a regularized model as compared to the latent variable
models. The regularized models have the added benefit of being easily
interpreted in terms of the process variables.

Table 1: Comparisons of scaled
error and variance for PCR, PLS, and ENwMC modeling techniques. The bold number
marks the model with the best performance for each variable.

Unit Operation

Output Variable

Error using ?

Variance of the prediction using?

PCR

PLS

ENwMC

PCR

PLS

ENwMC

Bioreactor

G0 Product Quality

3.79

4.25

3.41

 

0.146

0.148

0.087

Final Titer

5.38

3.35

5.40

0.281

0.287

0.178

DNA

7.58

6.77

5.20

 

0.209

0.201

0.223

HCP

4.30

2.85

1.67

 

0.258

0.210

0.150

Protein A Column

DNA

4.26

4.33

2.71

 

0.151

0.143

0.095

HCP

4.71

1.60

1.92

0.268

0.202

0.080

Total Impurity

9.22

7.98

2.40

 

0.286

0.256

0.164

HMW

2.08

2.54

1.11

 

0.117

0.092

0.045

Cation Exchange Column

HCP

1.57

1.99

1.96

0.226

0.132

0.083

Total Impurity

7.78

5.73

7.18

0.323

0.348

0.226

HMW

1.73

1.45

0.32

 

0.058

0.063

0.010

Anion Exchange Column

HCP

2.63

2.59

1.20

 

0.189

0.140

0.048

Total Impurity

4.65

1.56

2.48

0.228

0.227

0.115

HMW

0.54

0.24

0.23

 

0.067

0.050

0.007

References

[1]

S. M. Mercier, B. Diepenbroek, M. C. F. Dalm and R. H. Wijffels, "Mutlivariate data analysis as a PAT tool for early bioprocess development data," Journal of Biotechnology, vol. 167, pp. 262-270, 2013.

[2]

A. Kirdar, J. Conner, J. Baclaski and A. S. Rathore, "Application of multivariate analysis toward biotech processes: Case study of a cell-culture unit operation," Biotechnology Progress, vol. 23, no. 1, pp. 61-67, 2007.

[3]

A. S. Rathore, N. Bhushan and S. Hadpe, "Chemometrics applications in biotech processes: A review," Biotechnology Progress, vol. 27, no. 2, pp. 307-315, 2011.

[4]

Y. Roggo, P. Chalus, L. Maurer, C. Lema-Martinez, A. Edmond and N. Jent, "A review of near infrared spectroscopy and chemometrics in pharmaceutical technologies," Journal of Pharmaceutical and Biomedical Analysis, vol. 44, no. 3, pp. 683-700, 2007.

[5]

Z. Chen, D. Lovett and J. Morris, "Process analytical technologies and real time process control a review of some spectroscopic issues and challenges," Journal of Process Control, vol. 21, no. 10, pp. 1467-1482, 2011.

[6]

E. Read, J. Park, R. Shah, B. S. Riley, K. A. Brorson and A. S. Rathore, "Process analytical technology (PAT) for biopharmaceutical products: Part I concepts and applications," Biotechnology and Bioengineering, vol. 104, no. 2, pp. 276-284, 2010.

[7]

E. Read, R. Shah, B. S. Riley, J. T. Park, K. A. Brorson and A. S. Rathore, "Process analytical technology (PAT) for biopharmaceutical products: Part II concepts and applications," Biotechnology and Bioengineering, vol. 105, no. 2, pp. 285-295, 2010.

[8]

D. Bonné, M. A. Alvarez and S. B. Jorgensen, "Data driven modeling for monitoring and control of industrial fed-batch cultivations," Industrial & Engineering Chemistry Research, vol. 53, pp. 7365-7381, 2013.

[9]

S. Pampuri, A. Schirru, G. Fazio and G. De Nicolao, "Multilevel lasso applied to virtual metrology in semiconductor manufacturing," in 2011 IEEE International Conference on Automation Science and Engineering, Trieste, 2001.

[10]

H. Zou and T. Hastie, "Regularization and variable selection via the Elastic Net," Journal of the Royal Statistical Society, Series B (Statistical Methodology), vol. 67, no. 2, pp. 301-320, 2005.

[11]

N. Metropolis and S. Ulam, "The Monte Carlo method," Journal of the American Statistical Association, vol. 44, no. 247, pp. 335-341, 1949.

[12]

A. Shukla and J. Thommes, "Recent advances in large-scale production of monoclonal antibodies and related proteins," Trends in Biotechnology, vol. 28, no. 5, pp. 253-261, 2010.