(394j) Data Driven Machine Learning Based Inverse Polymer Membrane Design for CO2 separation | AIChE

(394j) Data Driven Machine Learning Based Inverse Polymer Membrane Design for CO2 separation

Authors 

Wang, Z. G., California Institute of Technology
Designing polymer membranes with high gas permeability and selectivity presents a multifaceted challenge due to the inherent trade-off between these properties. This study introduces a data-driven, machine learning (ML)-based genetic algorithm aimed at the design of polymer membranes for CO2 separation from N2 and O2. By utilizing literature data on the permeability of three gases, we generate Simplified Molecular-Input Line-Entry System (SMILES) representations of polymers based on their repeating units. Moreover, we leverage two fingerprinting techniques mentioned in commonly in the literature: one employs hash-based fingerprints, while the other uses substructure keys-based fingerprints. Our results indicate that hash-based fingerprints produce lower predictive errors in test sets compared to substructure keys-based fingerprints. Subsequently, we develop multiple ML models with hash-based fingerprints to predict the permeabilities and selectivities of gases in polymer membranes, demonstrating that regression models using tree-based ML algorithms are most effective in predicting gas permeability. We then integrate our 'forward ML model'—which predicts gas permeability and selectivity—with a generative algorithm for inverse design, allowing for a more effective exploration of the polymer material space. Genetic algorithms (GAs) are utilized to iteratively enhance solutions to the complex design problem, often navigating the vast and intricate search spaces more efficiently than traditional optimization methods. The initial gene pool is formed using the 'Breaking of Retrosynthetically Interesting Chemical Substructures' (BRICS) algorithm, yielding 79 unique genes from 780 reference polymers. To start the GA, we create 100 parent polymers with 4 genes in each monomer unit for the first generation. Fifteen families with the smallest Tanimoto similarity scores are then constructed, each with three parent polymers, and crossover operations are conducted to vary their chemical building block sequences, creating 12 offspring polymers per family. Each offspring’s viability is assessed using a fitness function, which is pivotal in driving the evolutionary process of the GA. Running the GA for 100 generations resulted in more than 16,000 new polymer structures, among which promising candidates for CO2/N2 and CO2/O2 separations were identified. The top 100 polymers are predicted to possess high glass transition temperatures, with approximately 20% featuring pyridine functionalities. This data-driven ML approach to inverse polymer design is a potent tool that is applicable to designing polymers for any scenario requiring constrained optimization. Future work will concentrate on employing graph representation along with message passing graph neural networks to more accurately represent polymer complexity.