(345q) Classification of Cardiomyocyte Content Differentiated from Human Induced Pluripotent Stem Cells
AIChE Annual Meeting
2021
2021 Annual Meeting
Computing and Systems Technology Division
Interactive Session: Data and Information Systems
Tuesday, November 9, 2021 - 3:30pm to 5:00pm
To overcome these challenges, we investigated machine learning techniques to identify the critical process parameters that impact the CM content on day 10 of hydrogel encapsulated hiPSC differentiation in microspheroids. We built a classification model to determine whether the CM content would be sufficient or not on day 10. The CM content on day 10 is a critical quality attribute that should be high enough to continue the production towards heart tissue maturation. We investigated two approaches for building the classification model, and this presentation will discuss each method and the results in detail. The first approach utilizes the data collected from bio-process experiments as the inputs for the construction of the classification model. In the second approach, the input data for building the classifiers are the phase-contrast images of the microspheroids taken on day 5 of differentiation.
The machine learning techniques used for the first approach are feature engineering, feature selection, and classification (Figure 1). With feature engineering, new features are extracted from the existing features with the aim of incorporating expert knowledge. Using feature selection, the combinations of the features, which could be a strong set of predictors, are identified.4 Finally, using the selected features, the classifiers are trained. Three data-driven models, Random Forest (RF)5, Gaussian Process (GP)6, and Support Vector Machines (SVM)7, were trained as classifiers. The bio-process features, which describe the experimental conditions, include initial cell number, cell concentration, the post-freeze passage of the cells, size and axial ratio of the microspheroids, differentiation media, CHIR molecule concentration, and PEG-fibrinogen concentration. Nine new features were extracted from the bio-process features using feature engineering: the surface and volume of the microspheroids, the surface-to-volume ratio, CHIR molecule concentration per surface, CHIR molecule concentration per volume, the ratio of CHIR molecule concentration and surface per volume, and inverse of the ratios. The differentiation media, which is a categorical feature, was converted to numerical variables using one-hot encoding.8
The feature selection methods used in this study were a filter method9 followed by principal component analysis (PCA)10, embedded methods11,12, or wrapper methods13. Using the filter method, only one of the features, which had correlations above 0.85, was kept yielding the filtered feature set. In PCA, the principal components (PCs) describing 90% of the input data variance were selected for building the classification model. The built-in functions of RF and GP modeling were used as the embedded feature selection methods for choosing the features with a significant impact on the prediction. In wrapper methods, different combinations of the features are used to build the classifier, and the set of features with the best classification performance is selected as the final input feature set.14 We investigated forward selection, backward elimination, and bidirectional methods15,16 as wrapper methods. The features are gradually added to the classifier model in the forward selection method, and the model with the best performance is selected. In the backward elimination method, the process is the opposite of the forward selection. In each step, the features are gradually eliminated from the feature set. The bidirectional method is a combination of the two. All three methods were employed with the filtered features and PCs as inputs. The performance of the models was compared based on Matthewâs correlation coefficient (MCC)17 and accuracy18.
In the second approach, images were used as the input for building the classification model. The discussions with our experimental collaborators suggest that the cell images taken on day 5 of the differentiation (Figure 2) are indicative of the final CM content on day 10. We investigated if this information could be captured by the machine-learning techniques and compared it to the models trained using the bio-process features. For preprocessing, the images were augmented to increase the number of available data points. Each image was both flipped and rotated 180°. The Histogram of Oriented Gradient (HOG)19 was added as an additional feature. The PCA was used as the feature selection method, and the PCs describing 95% of the input variance were chosen. The classifier model utilizes SVMs. The performance metrics for evaluating the models were accuracy and MCC.
Eighty-six bio-process data points and 301 images used for modeling were collected from the experiments where the CMs were produced by a single-step cell handling in a 3D microenvironment. In this scaffold-based approach, the hiPSCs were encapsulated in PEG-fibrinogen extracellular matrix using a novel and cost-effective microfluidic system20 (Figure 1). The selected features were used to construct the models to classify the CM content on day 10 of the differentiation into two groups of âsufficientâ (CM content > 65%) and âinsufficientâ (CM content > 65%).
The best classifier trained using the bio-process features as inputs is the GP model with features selected by the forward selection method on PCs. This model had an accuracy of 75% and an MCC of 0.46. The PCs selected by the forward selection method were not a strong descriptor of input variance data, which suggested more cell growth-related features may be required for improving the classifiers. The best model using images as inputs had an accuracy of 74% and an MCC of 0.49, which was comparable to the results obtained using the bio-process parameters. The current work focuses on combining the data from the bio-process experiments and data from images to construct an ensemble model with higher accuracy and MCC.
References
- Murphy SL, Xu J, Kochanek KD, Arias E. Mortality in the United States, 2017. NCHS Data Brief. 2018;(328):1-8.
- Kropp C, Kempf H, Halloin C, et al. Impact of Feeding Strategies on the Scalable Expansion of Human Pluripotent Stem Cells in Single-Use Stirred Tank Bioreactors. Stem Cells Transl Med. 2016;5(10):1289-1301. doi:10.5966/sctm.2015-0253
- Halloin C, Schwanke K, Löbel W, et al. Continuous WNT Control Enables Advanced hPSC Cardiac Processing and Prognostic Surface Marker Identification in Chemically Defined Suspension Culture. Stem Cell Reports. 2019;13(2):366-379. doi:10.1016/j.stemcr.2019.06.004
- Blum AL, Langley P. Artificial Intelligence Selection of relevant features and examples in machine. Artif Intell. 1997;97(1-2):245-271.
- Breiman LEO. Random Forests. 2001:5-32.
- Williams CKI, Rasmussen CE. Gaussian Processes for Machine Learning. Vol 2. MIT press Cambridge, MA; 2006.
- Drucker H, Shahrary B, Gibbon DC. Support vector machines: Relevance feedback and information retrieval. Inf Process Manag. 2002;38(3):305-323. doi:10.1016/S0306-4573(01)00037-1
- Potdar K, S. T, D. C. A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers. Int J Comput Appl. 2017;175(4):7-9. doi:10.5120/ijca2017915495
- Soper HE, Young AW, Cave BM, Lee A, Pearson K. On the Distribution of the Correlation Coefficient in Small Samples. Appendix II to the Papers of âStudentâ and R. A. Fisher. Biometrika. 1917;11(4):328. doi:10.2307/2331830
- Hotelling H. Analysis of a complex of statistical variables into principal components. J Educ Psychol. 1933;24(6):417.
- Wah YB, Ibrahim N, Hamid HA, Abdul-Rahman S, Fong S. Feature selection methods: Case of filter and wrapper approaches for maximising classification accuracy. Pertanika J Sci Technol. 2018;26(1):329-340.
- Naqvi S. A Hybrid Filter-Wrapper Approach for FeatureSelection. 2011.
- Kohavi R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. 1995.
- Das S. Filters, wrappers and a boosting-based hybrid for feature selection. In: Icml. Vol 1. ; 2001:74-81.
- JoviÄ A, BrkiÄ K, BogunoviÄ N. A review of feature selection methods with applications. In: 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO). ; 2015:1200-1205.
- Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. 2003;3(Mar):1157-1182.
- Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta - Protein Struct. 1975;405(2):442-451. doi:https://doi.org/10.1016/0005-2795(75)90109-9
- Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Inf Process Manag. 2009;45(4):427-437. doi:10.1016/j.ipm.2009.03.002
- Freeman WT, Roth M. Orientation Histograms for Hand Gesture Recognition. Gesture. 1994.
- Seeto WJ, Tian Y, Pradhan S, Kerscher P, Lipke EA. Photocrosslinked Microspheres: Rapid Production of CellâLaden Microspheres Using a Flexible Microfluidic Encapsulation Platform (Small 47/2019). Small. 2019;15(47):1970254. doi:10.1002/smll.201970254