(108a) Virtual High-Throughput Screening of Vapor-Deposited Amphiphilic Polymers for Biofilm Reduction with Machine/Deep Learning

Conference

AIChE Annual Meeting

Year

2021

Proceeding

2021 Annual Meeting

Group

Topical Conference: Applications of Data Science to Molecules and Materials

Session

Applications of Data Science to High Throughput Experimentation

Time

Monday, November 8, 2021 - 12:30pm to 12:45pm

Authors

Feng, Z. - Presenter, Cornell University

Cheng, Y., Cornell University

Hook, A. L., The University of Nottingham

Yang, R., Cornell University

Varner, J. D., Cornell University

Researchers identified amphiphilic copolymers as a promising material against biofilm formation on separation membrane surfaces in the recent decade. However, a guide for developing amphiphilic copolymers remains unclear in many aspects (e.g., optimal pairs and compositions of hydrophobic and hydrophilic moieties). Conventionally, one would try to screen material candidates from many experiments and discover a material. This method indeed would provide a path to reach the goal, but it requires a much longer time to explore and improve a testing sample along with costly chemicals and equipment.

For solving these issues, artificial intelligence (AI) could potentially assist or even guide material development while making this discovery more efficient but less expensive. This abstract presents a robust machine/deep learning model across seven data sources and a potential path to fast screen material candidates with AI, which have not been generalized by other computational models for amphiphilic copolymer high-throughput screening.

In our approach with RDKit, our team constructed a synthetic feature library based on the experimental antifouling and antibiofilm performances (quantified by logarithmic fluorescence intensities) of 2,435 unique mixed copolymers against Pseudomonas aeruginosa PA01 (PA) suspensions. Here, we labeled each copolymer's quantified performance as Log FPA. After data cleaning, we randomly split the database with stratification (an 80:20 train-test-split) so that both training and test sets followed a similar distribution, and we normalized the training set. Then, we used the same method earlier to split the normalized training set into a sub-training set and a validation set. We built an autoencoder for dimensionality reduction with these two subsets, which kept as much information in a lower dimension and significantly reduced the datasets' noise.

Furthermore, we trained a radial-basis-function-kernelized support vector regressor (SVR) in 5-fold cross-validation (CV) with benchmarks of root mean squared error (RMSE) and coefficient of determination (R2). Meanwhile, for nine different compositions (e.g., from 10:90 to 90:10 with an increment of 10 on the numerator) of amphiphilic copolymers, we developed nine unseen amphiphilic copolymer databases with 61 hydrophobic and five hydrophilic moieties. These two groups of moieties were labeled based on their Log P values (e.g., hydrophobic moiety has a Log P greater than 2, and hydrophilic moiety has a Log P less than 0). Also, we built two databases that solely had either hydrophobic polymers or hydrophilic polymers. Upon passing the final model evaluation, we used this trained model to screen these nine unseen databases and two polymer databases and identified promising materials for experimental validation.

By designing the architecture of the encoding and decoding layers, the autoencoder achieved a reconstruction error of 3.50 * 10e-4 for the sub-training set and 4.05 * 10e-4 for the validation set during the dimensionality reduction step. By applying the 5-fold CV, the SVR model gained the sub-training and validation sets' benchmarks: for a Log FPA range from 5 to 8, the average RMSE was 0.363 (+/- 0.007) and 0.395 (+/- 0.028). The average R2 was 0.642 (+/- 0.011) and 0.576 (+/- 0.045). Then, the trained model achieved an RMSE of 0.360 for the training set and 0.395 for the test set regarding the final model evaluation step. Also, it had an R2 of 0.649 for the training set and an R2 of 0.571 for the test set. In Figure 1 attached in the image file, we present the visualized final model evaluation results. Finally, the model's overall performance for the entire database was as follows: an RMSE of 0.354 and an R2 of 0.660.

For the 11 databases, the trained model screened 2,745 unseen amphiphilic copolymers, 61 hydrophobic polymers, and five hydrophilic polymers. Based on the ranked performances and the vapor-deposition experimental requirements (e.g., volatility of the monomers), our team selected Hexafluoroisopropyl Methacrylate (HFiPMA) and 2-Hydroxyethyl Methacrylate (HEMA) as the amphiphilic copolymer at a composition of 40:60 for experimental validation. We chose this composition because this copolymer under this composition had the optimal predicted Log FPA. Figure 2 attached in the image file presents the predicted Log FPA for this copolymer against all screened compositions.

Our results present a robust model across multiple data sources and a promising material, demonstrating a material screening procedure and guiding the novel material discovery. Beyond this, our team is looking forward to decoding what the model learned and applying a less-data-dependent method to boost our material discovery, such as reinforcement learning.

Topics

Computational Molecular Engineering

Materials

Other Sites & Tools

Technical Groups

Technical

Professional/Personal Growth

Societal Needs

Leadership

2025 Spring Meeting and 21st Global Congress on Process Safety

2025 AIChE Annual Meeting

Upcoming Conferences & Events

CEP: November 2024

CEP: October 2024

Explore Areas of Advancement:

Learning Center:

Want to be an Entrepreneur? Personal Stories From Three Successful Entrepreneurs Who Have Traveled This Path.