(108a) Virtual High-Throughput Screening of Vapor-Deposited Amphiphilic Polymers for Biofilm Reduction with Machine/Deep Learning
AIChE Annual Meeting
2021
2021 Annual Meeting
Topical Conference: Applications of Data Science to Molecules and Materials
Applications of Data Science to High Throughput Experimentation
Monday, November 8, 2021 - 12:30pm to 12:45pm
For solving these issues, artificial intelligence (AI) could potentially assist or even guide material development while making this discovery more efficient but less expensive. This abstract presents a robust machine/deep learning model across seven data sources and a potential path to fast screen material candidates with AI, which have not been generalized by other computational models for amphiphilic copolymer high-throughput screening.
In our approach with RDKit, our team constructed a synthetic feature library based on the experimental antifouling and antibiofilm performances (quantified by logarithmic fluorescence intensities) of 2,435 unique mixed copolymers against Pseudomonas aeruginosa PA01 (PA) suspensions. Here, we labeled each copolymer's quantified performance as Log FPA. After data cleaning, we randomly split the database with stratification (an 80:20 train-test-split) so that both training and test sets followed a similar distribution, and we normalized the training set. Then, we used the same method earlier to split the normalized training set into a sub-training set and a validation set. We built an autoencoder for dimensionality reduction with these two subsets, which kept as much information in a lower dimension and significantly reduced the datasets' noise.
Furthermore, we trained a radial-basis-function-kernelized support vector regressor (SVR) in 5-fold cross-validation (CV) with benchmarks of root mean squared error (RMSE) and coefficient of determination (R2). Meanwhile, for nine different compositions (e.g., from 10:90 to 90:10 with an increment of 10 on the numerator) of amphiphilic copolymers, we developed nine unseen amphiphilic copolymer databases with 61 hydrophobic and five hydrophilic moieties. These two groups of moieties were labeled based on their Log P values (e.g., hydrophobic moiety has a Log P greater than 2, and hydrophilic moiety has a Log P less than 0). Also, we built two databases that solely had either hydrophobic polymers or hydrophilic polymers. Upon passing the final model evaluation, we used this trained model to screen these nine unseen databases and two polymer databases and identified promising materials for experimental validation.
By designing the architecture of the encoding and decoding layers, the autoencoder achieved a reconstruction error of 3.50 * 10e-4 for the sub-training set and 4.05 * 10e-4 for the validation set during the dimensionality reduction step. By applying the 5-fold CV, the SVR model gained the sub-training and validation sets' benchmarks: for a Log FPA range from 5 to 8, the average RMSE was 0.363 (+/- 0.007) and 0.395 (+/- 0.028). The average R2 was 0.642 (+/- 0.011) and 0.576 (+/- 0.045). Then, the trained model achieved an RMSE of 0.360 for the training set and 0.395 for the test set regarding the final model evaluation step. Also, it had an R2 of 0.649 for the training set and an R2 of 0.571 for the test set. In Figure 1 attached in the image file, we present the visualized final model evaluation results. Finally, the model's overall performance for the entire database was as follows: an RMSE of 0.354 and an R2 of 0.660.
For the 11 databases, the trained model screened 2,745 unseen amphiphilic copolymers, 61 hydrophobic polymers, and five hydrophilic polymers. Based on the ranked performances and the vapor-deposition experimental requirements (e.g., volatility of the monomers), our team selected Hexafluoroisopropyl Methacrylate (HFiPMA) and 2-Hydroxyethyl Methacrylate (HEMA) as the amphiphilic copolymer at a composition of 40:60 for experimental validation. We chose this composition because this copolymer under this composition had the optimal predicted Log FPA. Figure 2 attached in the image file presents the predicted Log FPA for this copolymer against all screened compositions.
Our results present a robust model across multiple data sources and a promising material, demonstrating a material screening procedure and guiding the novel material discovery. Beyond this, our team is looking forward to decoding what the model learned and applying a less-data-dependent method to boost our material discovery, such as reinforcement learning.