(95g) A Compressed Sensing Framework for Learning Interpretable Molecular Property Models from Limited Data: Application to Discovery of Sustainable Battery Materials | AIChE

(95g) A Compressed Sensing Framework for Learning Interpretable Molecular Property Models from Limited Data: Application to Discovery of Sustainable Battery Materials

Authors 

Muthyala, M., The Ohio State University
Paulson, J., The Ohio State University
Molecular property models are of great interest and importance to sciences and technological developments. Predictive models reduce the need for high throughput DFT screening and/or experimentation in material searching problems. More recently there has been a shift in interest from traditional first principles models to a black-box data-driven modeling paradigm. Many model structures, such as neural networks [1,2] random forests [3,4] Gaussian processes [5], and support vector machines [6] to name a few, have been investigated in the context of modeling molecular properties. On one hand, they eliminate the need for domain expertise and Edisonian approaches to fitting models to data, and are often computationally cheaper to evaluate (an important attribute when large molecular classes must be screened for candidate molecules); however, they typically do not extrapolate, provide little-to-no insights into the underlying physics, and require vast amounts of data. Of particular importance, is their inability to generalize, which when coupled with the need for large data sets, calls into question whether or not these data-driven methods really are more efficient.

We delineate the traditional black-box data-driven modeling paradigm from the methods designed with human interpretability in mind, which we will refer to as symbolic and interpretable models (SIMs). SIMs can define a property in a manner that depicts contributions and competitions across various physically meaningful quantities as an algebraic model; similar to dimensionless numbers. Such methods focus on building low-dimensional terms from a high-dimensional feature space and an operator set, by finding feature and operator combinations that yield the highest correlation to observations. Training such models typically require combinatoric screening to identify terms with the highest correlation, and results in a computationally challenging problem. Although learning the SIMs can be challenging, the resulting models have several properties that justify the approach. In particular, when the feature space consists of physically meaningful descriptors, the resulting model can describe the underlying physics, which yield some of the following benefits: (i) learning a model requires fewer data points, relative to purely data-driven models; (ii) the learned models can provide physical insights into the investigated phenomena; (iii) the learned models have the ability to generalize better than black-box models; and (iv) the learned models provide an efficient "latent representation" for use in efficient non-parametric modeling/optimization frameworks.

This work demonstrates a systematic framework for the data-driven learning of SIMs. Specifically, we use a sure independence screening and sparsifying operator [7] to identify property descriptors from high dimensional Quantitative Structure-Property Relationships (QSPR) feature vectors [8]. Built upon many years of scientific research, QSPR descriptors represent a collection of molecular structural properties that have been deemed pertinent in modeling other chemical properties. Thus, they provide the physically meaningful features of molecules necessary to learn novel molecular property equations. First, we show that with just 115 molecules, we can learn an accurate model of reduction potential, which generalizes beyond the class of molecules used in training to accurately predict the reduction potential of over 100,000 held-out molecules. Second, we build a solubility model and predict both the redox potentials and solubilities of over 600,000 organic molecules. The large set of molecular predictions is used to identify a Pareto front corresponding to the two battery-relevant properties, from which we identify several synthesizable molecules to test as novel organic electrode materials. Initial testing of these novel organic electrode batteries shows energy densities and cycle-based degradation rates comparable to current state-of-the-art organic electrode batteries.

References:

[1] Vermeire, F. H.; Chung, Y.; Green, W. H. Predicting Solubility Limits of Organic Solutes for a Wide Range of Solvents and Temperatures. J. Am. Chem. Soc 2022, 10785–10797.

[2] Jin, W.; Barzilay, R.; Jaakkola, T. Junction Tree Variational Autoencoder for Molecular Graph Generation. 2019.

[3] Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J. C.; Sheridan, R. P.; Feuston, B. P. Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling. Journal of Chemical Information and Computer Sciences 2003, 43, 1947–1958, PMID: 14632445.

[4] Chen, C.-H.; Tanaka, K.; Funatsu, K. Random Forest Model with Combined Features: A Practical Approach to Predict Liquid-crystalline Property. Molecular Informatics 2019, 38, 1800095.

[5] Deringer, V. L.; Bart ́ok, A. P.; Bernstein, N.; Wilkins, D. M.; Ceriotti, M.; Cs ́anyi, G. Gaussian Process Regression for Materials and Molecules. Chemical Reviews 2021, 121, 10073–10141, PMID: 34398616.

[6] Jorissen, R. N.; Gilson, M. K. Virtual Screening of Molecular Databases Using a Support Vector Machine. Journal of Chemical Information and Modeling 2005, 45, 549–561, PMID: 15921445.

[7] Ouyang, R.; Curtarolo, S.; Ahmetcik, E.; Scheffler, M.; Ghiringhelli, L. M. SISSO: A compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates. Phys. Rev. Mater. 2018, 2, 083802.

[8] Moriwaki, H.; Tian, Y.; Kawashita, N. Mordred: a molecular descriptor calculator. J Cheminform 2018, 10.