(118d) Semi-Supervised Learning Detects Safe Islands Where Density Functional Theory Is Applicable for Chemical Discovery | AIChE

(118d) Semi-Supervised Learning Detects Safe Islands Where Density Functional Theory Is Applicable for Chemical Discovery

Authors 

Nandy, A., Massachusetts Institute of Technology
Liu, F., Stanford University
Density functional theory (DFT) plays an essential role in the discovery of new molecules and materials because of its optimal balance between accuracy and computational cost. Despite this balance, DFT can be inaccurate for strongly correlated systems (e.g. containing transition metals or stretched bonds) and chemical reactions that involve bond breaking and formation. To address this challenge, numerous multi-reference (MR) diagnostics have been developed over the years. These diagnostics aim to detect strong correlation and identify when single-reference methods (such as DFT) are insufficient for obtaining molecular properties. Despite the emergence of dozens of diagnostics, no single diagnostic has been universally predictive of MR character. Additionally, many such diagnostics are based on computationally demanding calculations (i.e., wave function theory, or WFT) that may also be intractable. Therefore, we aim to generate a low-cost and universal metric that measures whether a system is suitable for DFT evaluation. We generate a dataset that contains 15 MR diagnostics from multiple levels of theory on equilibrium and distorted geometries of organic molecules. We observe poor linear correlation across 15 MR diagnostics, with more expensive WFT based diagnostics outperforming the DFT based diagnostics in predicting a figure of merit for MR character. We train supervised learning models to predict more demanding WFT diagnostics using the DFT diagnostics along with the molecular geometries as inputs, to eliminate the cost of performing the WFT calculations. We show how a semi-supervised learning classifier unravels the hidden consensus of the 15 MR diagnostics by utilizing the distribution of the diagnostics for all data points. We demonstrate that this classifier is robust to noisy inputs and thus can achieve the same accuracy even it only sees the ML-predicted WFT diagnostics as the model inputs. Since our workflow only requires DFT calculations, we expect this data-driven decision engine to be useful in detecting “safe islands” for DFT based high-throughput computation and guiding the selection of the optimal electronic structure methods for the evaluation of new chemical systems in chemical discovery.