(700b) Machine Learning Approaches for the Prediction of Powder Behavior of Pharmaceutical Formulations from Physical Properties of the Active Pharmaceutical Ingredient | AIChE

(700b) Machine Learning Approaches for the Prediction of Powder Behavior of Pharmaceutical Formulations from Physical Properties of the Active Pharmaceutical Ingredient

Authors 

Brown, C., Strathclyde Institute of Pharmacy and Biomedical Sciences
Florence, A. J., University of Strathclyde
Marchal, S., Roche
Piccione, P. M., Process Studies Group, Technology & Engineering, Syngenta
Background

Formulation and process development activities are necessary to make a drug substance into a drug product – yet can be a time-consuming, iterative process. Artificial intelligence (AI) and machine learning (ML) have emerged as potential tools to accelerate the transition from formulation development to manufacturing and thus, the use of digital design and data-driven models provides the prospect to accelerate these important development steps (1). Changes in formulation from variation in excipients or their composition can impact bulk properties such as powder flowability and therefore risk impacting subsequent manufacturing processes (2). More subtle changes, for instance in the physical properties of the active pharmaceutical ingredient (API), such as particle size and shape, can also influence the manufacturability of the drug product (3, 4).

In the present work, we assess ML models to enable rapid decision-making on the viability of pharmaceutical formulations for continuous direct compression (CDC), hence minimizing the time and experimental resources needed to make this decision. 35 excipients and 17 APIs, combined in 75 pharmaceutical formulations, were analyzed and included in the training dataset of the ML classification models. The models classify the formulations into viable and non-viable categories depending on their flow properties, assessed with respect to the suitability for CDC. This abstract shows the preliminary results of an ongoing study.

Methods

To make reliable predictions using ML models, adequate training data are required to describe the range of variability encountered in the objective. Therefore, the data were first generated experimentally, followed by data curation, model training, and validation.

First, particle size and shape distributions of the single compounds for a series of test formulations were included in the training set as the independent variables. Particle shape (convexity, elongation, length, area, perimeter, sphericity, and aspect ratio) was measured using static image analysis performed by a Morphologi G3 (Malvern), and particle size distribution (D10, D50, D90) was measured using both laser diffraction (MasterSizer 3000, Malvern) and dynamic image analysis (QICPIC, Sympatec). The flow function coefficient (FFc) of the test formulations was included in the training dataset as the dependent variable (response). The FFc was measured with the ring shear tester created by Dietmar Schulze at a normal consolidation stress of 1000 Pa (relevant pressure of the CDC manufacturing line under consideration). The formulations were classified as viable or non-viable if their FFc was greater or smaller than 5, respectively, based on equipment needs determined in-house. The data curation process involved filtering and preparing the final dataset that will go into the training step, i.e., Pearson correlation coefficient was calculated to remove the highly correlated features.

For the model training, initially, Principal Component Analysis (PCA) and Louvain clustering were performed as unsupervised algorithms for multivariate data analysis and to identify smaller groups of variables to investigate the hidden patterns within the data (5, 6). Likewise, Louvain clustering is used to detect groups of data in a large dataset, so the similarities between the formulations can be analyzed (7). PCA was calculated using the python package sci-kit learn decomposition and visualized using the python library matplotlib. Louvain clustering was calculated and visualized using the Orange Data Mining software.

Subsequently, supervised learning algorithms were used to build a model that classifies formulations into viable and non-viable classes (threshold FFc = 5). The first step was the selection of the algorithm, using precision as the metric to compare model effectiveness. Precision is the ratio of the true positives to the total instances predicted as positives. High precision ensures that the formulations that are predicted as viable are indeed viable and therefore, suitable for CDC. The main reason to choose this metric is to avoid false positives (non-viable formulations predicted as viable) which would lead to waste of time and resources and jeopardize customer confidence. The best performing algorithm was selected among the following: random forest, support vector machines, gradient boosting, AdaBoost, k-Nearest Neighbors, logistic regression, neural network, and Naïve Bayes. Due to data availability constraints, the data was sampled using leave-one-out cross-validation to compare the performance of the classification algorithms, where one formulation was used for testing at a time, and the remaining formulations were used for training. This process was repeated for all the formulations included in the dataset. After this stage, the confusion matrix was computed from which the precision was calculated. Finally, the model was evaluated using an external set generated experimentally.

Results

The dataset included 75 formulations of different compositions and different drug loadings. The concentration of the formulation, particle size and shape descriptors of the API, 35 different excipients, and their FFc are included in the dataset. The calculation of the Pearson correlation coefficient helped decide what particle size and shape descriptors should be included in the training dataset. Setting a threshold of 0.9, descriptors that were highly correlated were removed from the analysis, leaving only 17 variables from the initial 35 size and shape descriptors. After this process, the data are curated and prepared for training.

The unsupervised learning algorithms failed since they did not show clusters of data based on the FFc of the formulations and thus, did not provide a useful classification method. Among the classification algorithms, gradient boosting achieved the best performance, exhibiting the highest precision score (0.896). Feature importance analysis was calculated based on the decrease of the precision for gradient boosting, denoting that the most important variables towards the result of the prediction were the FFc of the API and its particle size distribution Morphologi G3 D50 value. Furthermore, an external test was conducted to validate the performance of the model on unseen materials. More data is being generated to complete the dataset.

Implications

The ML classification model’s initial 89.6% precision in prediction of viability of a formulation for CDC based simply on particle size and shape measurements, FFc of single components, and composition of the formulation is a promising indication that this technique could have routine application in rapid assessment of manufacturability of novel formulations. This would allow rapid decision making in early-stage development when time and the amount of API are at a premium. The precision of the model is acceptable; however, it can be improved with the availability of more data from wider training set of materials and work is continuing to extend the training set with a wider range of material attributes, composition and formulation performance and improve the general applicability. The model could ultimately be adapted to different FFc thresholds, adaptive to the considered manufacturing process. Its extension to other formulations attributes (e.g., wall frictions) will be further investigated. The model could also be extended to inform formulation optimization or even to provide a performance target for particle engineering efforts thereby reducing the reliance on complex and expensive secondary processing or formulation steps to overcome physical property limitations in raw materials. Furthermore, successful implementation of such validated data-driven models in a user-friendly form would facilitate the transition to Industry 4.0 in pharmaceutical development and manufacture.

References

  1. Yang Y, Ye Z, Su Y, Zhao Q, Li X, Ouyang D. Deep learning for in vitro prediction of pharmaceutical formulations. Acta pharmaceutica sinica B. 2019;9(1):177-85.
  2. Leane M, Pitt K, Reynolds G, Group MCSW. A proposal for a drug product Manufacturing Classification System (MCS) for oral solid dosage forms. Pharmaceutical development and technology. 2015;20(1):12-21.
  3. Ticehurst MD, Marziano I. Integration of active pharmaceutical ingredient solid form selection and particle engineering into drug product design. Journal of Pharmacy and Pharmacology. 2015;67(6):782-802.
  4. Leane M, Pitt K, Reynolds GK, Dawson N, Ziegler I, Szepes A, et al. Manufacturing classification system in the real world: factors influencing manufacturing process choices for filed commercial oral solid dosage formulations, case studies from industry and considerations for continuous processing. Pharmaceutical development and technology. 2018;23(10):964-77.
  5. Abdi H, Williams LJ. Principal component analysis. Wiley interdisciplinary reviews: computational statistics. 2010;2(4):433-59.
  6. Education IC. Unsupervised Learning IBM Cloud Learn Hub IBM; 2020 [updated 21/09/2020. Available from: https://www.ibm.com/cloud/learn/unsupervised-learning#:~:text=Unsupervised%20learning%2C%20also%20known%20as,the%20need%20for%20human%20intervention.
  7. Li H, Liu Z. Multivariate time series clustering based on complex network. Pattern Recognition. 2021;115:107919.