(15h) High-Dimensional Bayesian Optimization of Molecular Properties Using Quantitative Structure-Property Relationships on Sparse Axis-Aligned Subspaces | AIChE

(15h) High-Dimensional Bayesian Optimization of Molecular Properties Using Quantitative Structure-Property Relationships on Sparse Axis-Aligned Subspaces

Authors 

Banker, T., The Ohio State University
Paulson, J., The Ohio State University
The development of next-generation technologies, such as biomaterials, novel catalysts, solar panels, and batteries, hinges on our ability to efficiently identify or discover molecules that exhibit more desirable properties for a given application. Identifying which molecules exhibit the best properties is a notoriously challenging problem since approximately 10100 unique materials can be constructed from the periodic table [1], which is far beyond the scale existing synthesis and testing capabilities. Traditional material exploration methods have thus relied on Edisonian approaches, which are heuristic search strategies constructed from a practitioner's observations and expertise.

The ad-hoc approaches to materials optimization can be partially attributed to the non-numerical nature of molecules, which prevents traditional optimization methods from being used. However, even with a well-defined mapping of molecules to a numeric space, or formally, a latent space [2], one cannot apply standard first-order methods (such as gradient descent) to optimize molecular properties since the functional relationship between the latent molecular inputs and desired properties is unknown. As such, a zeroth-order (derivative-free) optimization approach is needed to search for optimal molecules in the latent space; however, these methods are known to have challenges scaling to high-dimensional representations [3]. As such, the difficult in molecular property optimization is twofold: (i) How should we define effective molecular representations that are suitable for optimization? (ii) How can we deal with the unknown (black-box) relationship between molecular features and desired properties in potentially high-dimensional representation spaces?

Although there has been significant interest in using machine learning to optimize molecular properties, the majority of works fall into one of two categories that suffer from some important challenges. The first class of methods take a ``model and then optimize'' view wherein supervised learning methods are used to construct a predictive model that maps molecular features to properties that is then optimized to identify potentially promising candidate materials [4-6]. Such a strategy can be effective whenever three key assumptions are met: (i) the selected ``feature space'' is general enough to provide a unique/holistic description of the considered set of molecules; (ii) the type of model (e.g., neural network) is general enough to capture the relationship between these features and properties; and (iii) enough data is available to train the potentially large number of parameters appearing in the model. In many important applications, however, at least one of these assumptions is violated, which creates several important problems in practice – the biggest issue is the trained model is likely to produce inaccurate predictions with all predicted ``high performance'' molecules suffering from the same bias. The second class of methods directly treat the optimization problem as a learning problem using concepts from Bayesian optimization (BO) [7-9]. These types of methods have the potential to be much more data-efficient than those in the first class since they specifically address the tradeoff between exploration (learning for the future) and exploitation (immediately advancing toward the goal). However, BO relies on the ability to construct a probabilistic surrogate model of the unknown property function, which is challenging in high-dimensional molecular feature representations. Recent work has focused on the use of unsupervised learning techniques, such as variational autoencoders [9], to construct a low-dimensional latent space on which BO can be directly performed. Since this latent space is constructed completely offline (independent of the property of interest), there is no guarantee that the latent variables are sufficiently well-correlated to the property variable to satisfy smoothness assumptions inherent to BO.

In this work, we propose a novel active learning method for efficient material property optimization in high-dimensional molecular feature spaces. The proposed method builds upon the BO framework; however, it involves two important modifications that address the issues identified above with standard dimensionality reduction techniques. We first take advantage of information-rich Quantitative Structure-Property Relationships (QSPR) descriptors [10] to represent molecules, as opposed to commonly used alternatives such as SMILES strings [11], graph encodings [12], or molecular fingerprinting [13] methods. QSPRs are computed from translationally and rotationally-invariant geometric functions on a molecule’s structure and/or atomic properties, which include a variety of features ranging from simple (e.g., atom masses) to complex (e.g., hydrogen bond donors/acceptors) quantities. Specifically, we use the Mordred [14] cheminformatics toolbox to produce more than 1800 numerical features to represent molecules. Second, as opposed to reducing the dimensionality of this space offline using unsupervised learning methods, we opt for replacing the traditional Gaussian process (GP) surrogate model with a GP model defined on sparse axis-aligned subspaces (SAAS) [15]. The main advantage of SAAS-GP models is their ability to on-the-fly identify the relevant dimensions for optimization as the BO algorithm progresses. Therefore, instead of a priori selecting a latent space, we are iteratively constructing one as new data is collected. The proposed methodology consistently outperforms state-of-the-art algorithms (e.g., [16]) in terms of sampling and computational efficiency on a variety of benchmark molecular property optimization problems.

References:

[1] Le, T. C.; Winkler, D. A. Discovery and Optimization of Materials Using Evolutionary

Approaches. Chemical Reviews 2016, 116, 6107–6132, PMID: 27171499.

[2] Bjerrum, Esben Jannik, and Boris Sattarov. "Improving chemical autoencoder latent space and molecular de novo generation diversity with heteroencoders." Biomolecules 8.4 (2018): 131.

[3] Frazier, Peter I. "A tutorial on Bayesian optimization." arXiv preprint arXiv:1807.02811 (2018).

[4] Zhu, Ming-Xiao, et al. "Machine-learning-driven discovery of polymers molecular structures with high thermal conductivity." International Journal of Heat and Mass Transfer 162 (2020): 120381.

[5] Meftahi, Nastaran, et al. "Machine learning property prediction for organic photovoltaic devices." npj computational materials 6.1 (2020): 166.

[6] Dobbelaere, Maarten R., et al. "Machine learning for physicochemical property prediction of complex hydrocarbon mixtures." Industrial & Engineering Chemistry Research 61.24 (2022): 8581-8594.

[7] Wang, Ke, and Alexander W. Dowling. "Bayesian optimization for chemical products and functional materials." Current Opinion in Chemical Engineering 36 (2022): 100728.

[8] Zhang, Yichi, Daniel W. Apley, and Wei Chen. "Bayesian optimization for materials design with mixed quantitative and qualitative variables." Scientific reports 10.1 (2020): 1-13.

[9] Lei, Bowen, et al. "Bayesian optimization with adaptive surrogate models for automated experimental design." Npj Computational Materials 7.1 (2021): 194.

[10] Le, Tu, et al. "Quantitative structure–property relationship modeling of diverse materials properties." Chemical reviews 112.5 (2012): 2889-2919.

[11] O’Boyle, Noel M. "Towards a Universal SMILES representation-A standard method to generate canonical SMILES based on the InChI." Journal of cheminformatics 4 (2012): 1-14.

[12] Wieder, Oliver, et al. "A compact review of molecular property prediction with graph neural networks." Drug Discovery Today: Technologies 37 (2020): 1-12.

[13] Diddams, Scott A., Leo Hollberg, and Vela Mbele. "Molecular fingerprinting with the resolved modes of a femtosecond laser frequency comb." Nature 445.7128 (2007): 627-630.

[14] Moriwaki, H.; Tian, Y.; Kawashita, N. Mordred: a molecular descriptor calculator. J

Cheminform 2018, 10.6

[15] Eriksson, David, and Martin Jankowiak. "High-dimensional Bayesian optimization with sparse axis-aligned subspaces." Uncertainty in Artificial Intelligence. PMLR, 2021.

[16] Deshwal, Aryan, and Jana Doppa. "Combining latent space and structured kernels for bayesian optimization over combinatorial spaces." Advances in Neural Information Processing Systems 34 (2021): 8185-8200.