(119d) The Molecular Descriptors with Actively Identified Subspaces (MolDAIS) Framework for Efficient Molecular Property Optimization and Beyond | AIChE

(119d) The Molecular Descriptors with Actively Identified Subspaces (MolDAIS) Framework for Efficient Molecular Property Optimization and Beyond

Authors 

Muthyala, M., The Ohio State University
Chen, T. Y., The Ohio State University
Paulson, J., The Ohio State University
Molecular property optimization (MPO) is the process of systematically selecting molecules with improved structural and/or functional properties to achieve a desirable set of objectives and represents a critical step in many material and drug discovery applications. A key challenge in MPO is the selection of a useful numerical representation of molecules that enable systematic optimization to be performed in representation space. Many different representations have been proposed including SMILES and SELFIES strings [1, 2], molecular graphs [3], molecular fingerprints [4], and molecular descriptors [5]. An important challenge with all these approaches is that they are high-dimensional representations, which greatly complicates the optimization process. This challenge has motivated the development of methods that perform optimization over some type of continuous latent space trained in an unsupervised fashion on a large set of molecules. For example, one can initially train a deep variational autoencoder and then perform, e.g., Bayesian optimization in the latent space to generate promising candidate latent codes that are decoded into molecules for which the objective function can be evaluated [6, 7]. This approach is not only computationally demanding (training the encoder-decoder structure often requires a significant amount of compute), it may also require expert intuition to design the neural network structure. Even more importantly, since the unsupervised encoder-decoder has been trained in the absence of property data, there is no driving force for the mapping between latent codes and property values to have any underlying smoothness or continuity properties. Such properties are crucial for sample-efficient algorithms such as Bayesian optimization [8], which leverage smoothness to gracefully tradeoff between exploration and exploitation of the search/design space.

In this work, we present the MolDAIS (Molecular Descriptors with Actively Identified Subspaces) framework [9] that tackles MPO problems from a different perspective. Specifically, MolDAIS combines a high-dimensional molecular descriptor representation of the search space with a Gaussian process (GP) model defined over a sparse-axis aligned subspaces (SAAS) prior [10] over which we can deploy standard Bayesian optimization methods (such as expected improvement). The fundamental assumption being made by MolDAIS is that only a fairly small number of well-designed features are needed to accurately predict any specific property of interest. Since these key features are rarely known a priori, we aim to learn them (from a large set of features) in the low-data regime using a sparsity-inducing GP prior. This idea is similar to that motivating so-called explainable machine learning methods that have seen an increase in popularity in recent years [11, 12]. An important difference in MolDAIS is that the chosen descriptors are actively learned in the sense that they evolve as more property data is collected – we have observed the ability to on-the-fly correct for errors in the representation can have a large influence on performance. MolDAIS has been extensively benchmarked versus several competing MPO algorithms; we have found it to routinely outperform all other tested approaches and, in some cases, is able to find the best-in-class molecule from more than 100,000 candidates by testing only ~20 molecules. We will also show how MolDAIS can be straightforwardly extended to problems outside of traditional MPO including constrained, multi-objective, and human-in-the-loop settings.

References:

[1] O’Boyle, N. M. (2012). Towards a Universal SMILES representation-A standard method to generate canonical SMILES based on the InChI. Journal of cheminformatics, 4, 1-14.

[2] Krenn, M., Häse, F., Nigam, A., Friederich, P., & Aspuru-Guzik, A. (2020). Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4), 045024.

[3] Wieder, O., Kohlbacher, S., Kuenemann, M., Garon, A., Ducrot, P., Seidel, T., & Langer, T. (2020). A compact review of molecular property prediction with graph neural networks. Drug Discovery Today: Technologies, 37, 1-12.

[4] Cereto-Massagué, A., Ojeda, M. J., Valls, C., Mulero, M., Garcia-Vallvé, S., & Pujadas, G. (2015). Molecular fingerprint similarity search in virtual screening. Methods, 71, 58-63.

[5] Moriwaki, H., Tian, Y. S., Kawashita, N., & Takagi, T. (2018). Mordred: a molecular descriptor calculator. Journal of cheminformatics, 10, 1-14.

[6] Gómez-Bombarelli, R., Wei, J. N., Duvenaud, D., Hernández-Lobato, J. M., Sánchez-Lengeling, B., Sheberla, D., Aguilera-Iparraguirre, J., Hirzel, T. D., Adams, R. P., & Aspuru-Guzik, A. (2018). Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2), 268-276.

[7] Maus, N., Jones, H., Moore, J., Kusner, M. J., Bradshaw, J., & Gardner, J. (2022). Local latent space Bayesian optimization over structured inputs. Advances in neural information processing systems, 35, 34505-34518.

[8] Frazier, P. I. (2018). A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811.

[9] Sorourifar, F., Banker, T., & Paulson, J. A. (2024). Accelerating Black-Box Molecular Property Optimization by Adaptively Learning Sparse Subspaces. arXiv preprint arXiv:2401.01398.

[10] Eriksson, D., & Jankowiak, M. (2021, December). High-dimensional Bayesian optimization with sparse axis-aligned subspaces. In Uncertainty in Artificial Intelligence (pp. 493-503). PMLR.

[11] Wang, Y., Wagner, N., & Rondinelli, J. M. (2019). Symbolic regression in materials science. MRS Communications, 9(3), 793-805.

[12] Udrescu, S. M., & Tegmark, M. (2020). AI Feynman: A physics-inspired method for symbolic regression. Science Advances, 6(16), eaay2631.