(634e) Systematic Chem-Informatics and Machine Learning Studies for Gas Permeability and Selectivity in Polymers | AIChE

(634e) Systematic Chem-Informatics and Machine Learning Studies for Gas Permeability and Selectivity in Polymers

Authors 

Shi, W. - Presenter, National Energy Technology Laboratory, U.S. Department of Energy
Tiwari, S., Leidos Research Support Team
Steckel, J., National Energy Technology Laboratory
Sekizkardes, A., National Energy Technology Laboratory
Zhu, L., National Energy Technology Laboratory
Yi, S., Georgia Institute of Technology
Kusuma, V. A., Leidos Research Support Team
Resnik, K. P., Leidos Research Support Team - US DOE/NETL
Quickly and reliably predicting gas permeability and selectivity in polymers is important to develop effective and efficient polymers for specific gas separation applications, such as membrane development for post-combustion carbon capture. There are several challenges, such as building a polymer database with various different gases permeabilities, designing a scheme to encode the polymer repeat units, generating fingerprints and descriptors which could discern different polymer repeat unit structures, and building the machine learning (ML) models. We will address these problems in this presentation.

We extended the simplified molecular-input line-entry system (SMILES) for small molecules to encode the polymer repeat unit by utilizing the [*] ghost atom with different charges to specify different head and tail atoms for the repeat units of both simple and ladder homopolymers. Multiple-mers could be generated from this extended SMILE scheme. Starting from the Membrane Society of Australasia (MSA) polymer database, which contains different gases permeabilities, we built an in-house polymer database by carefully checking the data and references for accuracy, adding the SMILES for the polymer repeat units for homopolymers, random polymers, and polymer blends. We added more than 100 new datasets, and we also added other polymer properties, such as glass transition temperature, polymer density and free volume fraction. There are 1,674 sets of data in our in-house polymer database; each set corresponds to one polymer. There are 1,210 sets of data corresponding to homopolymers, out of which there are only 807 unique polymer repeat units. Five different fingerprints, that is, RDKfingerprint, MACCS keys, AtomPair fingerprint, Torsional fingerprint, and Morgan fingerprint were tested along with 189 different 2D descriptors. Although the RDKfingerprint has been used by another researcher group [2] to develop ML models to predict gas permeability in polymers, it was found in our work that the RDKfingerprint fails to describe 2.2% of the 807 unique repeat units. That is, for some different polymer repeat unit structures, the RDKfingerprint gives the same fingerprint values, which is unexpected. To fix this problem, we combined the RDKfingerprint with the 189 2D descriptors, which was found to be able to discern different polymer structures. In addition to the comparison between different fingerprints and descriptors to describe polymer repeat units, we will also show their ML performance by using a Gaussian process regression (GPR) algorithm and a different number of repeat units. Our work shows that the RDKfingerprint combined with 189 2D descriptors exhibits the best ML performance at a polymer repeat unit length of 10. Furthermore, shuffling of the data was found to significantly affect the ML performance. For example, using the same 1,071 sets of data for CO2 permeability along with 70% training and 30% test data, shuffling data (leading to different assignment to training and test data) could give R2 values (accuracy of determination) for the test data sets between 0.84 to 0.91 although the R2 values for the training set are approximately 0.98. We will show strategies to alleviate this problem. We will also show our preliminary results of designing polymers by using the iQSPR method [3,4].

References:

  1. https://docs.google.com/spreadsheets/d/1LXwkZfhrdLtLuG7WvrZBgjZ2u5QsATsby9yN_FW6kcw/edit#gid=1
  2. Sci. Adv. 2020; 6 : eaaz4301
  3. J Comput Aided Mol Des 31, 379–391
  4. Mol Inf 2020, 39, 1900107