(634e) Systematic Chem-Informatics and Machine Learning Studies for Gas Permeability and Selectivity in Polymers
AIChE Annual Meeting
2021
2021 Annual Meeting
Sustainable Engineering Forum
CO2 Capture for Power Generation
Thursday, November 11, 2021 - 4:30pm to 4:45pm
We extended the simplified molecular-input line-entry system (SMILES) for small molecules to encode the polymer repeat unit by utilizing the [*] ghost atom with different charges to specify different head and tail atoms for the repeat units of both simple and ladder homopolymers. Multiple-mers could be generated from this extended SMILE scheme. Starting from the Membrane Society of Australasia (MSA) polymer database, which contains different gases permeabilities, we built an in-house polymer database by carefully checking the data and references for accuracy, adding the SMILES for the polymer repeat units for homopolymers, random polymers, and polymer blends. We added more than 100 new datasets, and we also added other polymer properties, such as glass transition temperature, polymer density and free volume fraction. There are 1,674 sets of data in our in-house polymer database; each set corresponds to one polymer. There are 1,210 sets of data corresponding to homopolymers, out of which there are only 807 unique polymer repeat units. Five different fingerprints, that is, RDKfingerprint, MACCS keys, AtomPair fingerprint, Torsional fingerprint, and Morgan fingerprint were tested along with 189 different 2D descriptors. Although the RDKfingerprint has been used by another researcher group [2] to develop ML models to predict gas permeability in polymers, it was found in our work that the RDKfingerprint fails to describe 2.2% of the 807 unique repeat units. That is, for some different polymer repeat unit structures, the RDKfingerprint gives the same fingerprint values, which is unexpected. To fix this problem, we combined the RDKfingerprint with the 189 2D descriptors, which was found to be able to discern different polymer structures. In addition to the comparison between different fingerprints and descriptors to describe polymer repeat units, we will also show their ML performance by using a Gaussian process regression (GPR) algorithm and a different number of repeat units. Our work shows that the RDKfingerprint combined with 189 2D descriptors exhibits the best ML performance at a polymer repeat unit length of 10. Furthermore, shuffling of the data was found to significantly affect the ML performance. For example, using the same 1,071 sets of data for CO2 permeability along with 70% training and 30% test data, shuffling data (leading to different assignment to training and test data) could give R2 values (accuracy of determination) for the test data sets between 0.84 to 0.91 although the R2 values for the training set are approximately 0.98. We will show strategies to alleviate this problem. We will also show our preliminary results of designing polymers by using the iQSPR method [3,4].
References:
- https://docs.google.com/spreadsheets/d/1LXwkZfhrdLtLuG7WvrZBgjZ2u5QsATsby9yN_FW6kcw/edit#gid=1
- Sci. Adv. 2020; 6 : eaaz4301
- J Comput Aided Mol Des 31, 379â391
- Mol Inf 2020, 39, 1900107