(197au) Multi-Fidelity Deep Learning for Data-Efficient Molecular Property Models from Experimental and Computational Data | AIChE

(197au) Multi-Fidelity Deep Learning for Data-Efficient Molecular Property Models from Experimental and Computational Data

Authors 

Greenman, K. P. - Presenter, Massachusetts Institute of Technology
Gomez-Bombarelli, R., Massachusetts Institute of Technology
Green, W., Massachusetts Institute of Technology
Orkhon, T., Massachusetts Institute of Technology
Chemistry and materials design frequently involve data of different fidelities to the true property to be optimized, for instance experimental measurements and computational approximations, or simulations at different levels of theory. The different fidelity levels typically involve a cost-accuracy trade-off, where high-fidelity results are relatively expensive or slow to acquire, while low-fidelity results are cheaper and faster but include more bias and/or noise in the data. Low-fidelity data is often useful for covering a broader portion of chemical space in high-throughput virtual screening campaigns in tiered screens (computational funnels). More recently, machine learning has been used to connect low- to high-fidelity data through sequential approaches such as transfer learning [1] and Δ-machine learning [2]. Multi-fidelity methods show some advantages to improve the generalizability of high-fidelity predictions. They are less expensive than Δ-machine learning since they also learn the low-fidelity function, avoiding extra cost at inference; and they leverage the two data fidelities at the same time (compared to transfer learning) [3]. However, many questions regarding when multi-fidelity methods should be expected to perform well remain unanswered, such as the relationship between dataset needs at both fidelities, and how the accuracy mismatch influences the results.

To address these questions, we present a comprehensive benchmark of multi-fidelity methods. We systematically add noise and bias to a synthetic dataset and split high- and low-fidelity data in ways that mimic realistic use cases. We also evaluate the multi-fidelity methods on several real-world datasets of optical properties (experiments and time-dependent density functional theory calculations), solubility (experiments and COSMO-RS calculations), and drug efficacy/potency (single-dose and dose-response measurements). We compare the multi-fidelity model performance to transfer learning and Δ-machine learning and provide recommendations for best practices in training models when multiple levels of fidelity are available. Finally, we demonstrate the application of uncertainty quantification and active learning in these models. The more thorough understanding of multi-fidelity methods we develop in this work will allow for more data-efficient molecular and materials design.

References:

[1]: Vermeire, Florence H., and William H. Green. "Transfer learning for solvation free energies: From quantum chemistry to experiments." Chemical Engineering Journal 418 (2021): 129307.

[2]: Ramakrishnan, Raghunathan, et al. "Big data meets quantum chemistry approximations: the Δ-machine learning approach." Journal of chemical theory and computation 11.5 (2015): 2087-2096.

[3]: Buterez, David, et al. "Multi-fidelity machine learning models for improved high-throughput screening predictions." ChemRxiv (2022).