(420e) Active Learning for Data-Efficient Training of Machine Learning Models to Predict Adsorption in Metal-Organic Frameworks (MOFs).
AIChE Annual Meeting
2023
2023 AIChE Annual Meeting
Computational Molecular Science and Engineering Forum
Automated Molecular and Materials Discovery: Integrating Machine Learning, Simulation, and Experiment II
Tuesday, November 7, 2023 - 4:30pm to 4:45pm
The challenge is that classical simulations methods to predict adsorption (e.g., grand canonical Monte Carlo (GCMC)) are just âfast enoughâ to make thousands to hundreds of thousand adsorption predictions in a reasonable timeframe. However, finding the optimal MOF-OC combination for each chemical separation of interest is a task that would probably entail trillions of adsorption predictions. Thus, faster methods such as machine learning (ML) are better poised to take such task. In earlier work, some of us demonstrated the ability of multilayer perceptron (MLP) models to learn to predict adsorption at multiple conditions, for multiple molecules when provided with GCMC-generated training data for adsorption of different molecules at different pressures in different MOFs. However, this demonstration was limited to nearâspherical, non-polar molecules, and extension to a wider class of molecules requires increasing the diversity and size of the training data. However, due to the computational resources needed to generate training data and training the ML model, there is a critical need to keep training dataset as small as possible.
Active learning (AL) can play a very important role in efficiently and âsmartlyâ navigate the âadsorption spaceâ to limit the burden on data generation while enabling the training of highly predictive ML models. In this work, we first establish the implementation of a Gaussian process regression (GPR) framework to model pure component adsorption of nitrogen at 77K from 10-5 to 1 bar, methane at 298K from 10-5 to 100 bar, carbon dioxide at 298K from 10-5 to 100 bar, and hydrogen at 77K from 10-5 to 100 bar on eleven diverse sets of MOFs. In this GPR framework, a first model is trained with an initial data set known as the âprior.â Then subsequent models are retrained upon subsequent addition of adsorption data to the dataset, which is decided by the uncertainty of the GP model evaluated on a new data set. Here, we tested three different âpriorâ selection schemes and make a recommendation on the best prior selection scheme for 44 adsorbate-adsorbent pairs. Recommendation is primarily based on the mean absolute error and the total amount of data points required for convergence of the prediction of the ML model.
Upon establishing the GPR framework, we demonstrated the application of the methodology to include alchemical molecules. These hypothetical species can be characterized by two main features: intermolecular potential parameters (e.g., well-depth and the distance at which the intermolecular potential between two particles is zero); and intra-molecular properties, such as bond length and charges.
A previously developed MLP model trained on GCMC data points (approximately 5 million) obtained from 1800 topologically and chemically diverse ToBaCCo generated MOFs using several single- and multiple-site alchemical species at different fugacities has led to the progress in adsorption studies providing accurate results for a diverse set of real molecules. Using the established AL framework and the developed MLP model as a substitute for GCMC, we show we can make accurate GPR models that predict the isotherm of all alchemical species across these 1800 diverse MOFs using a different set of test-data set including the fugacity and alchemical parameters. Our results show we saved 57.5% of the data, indicating that only around 2.2 million simulations are needed to train a new MLP model for adsorption.