(197at) Assessing the Impact of Quantum Mechanical Descriptors on D-Mpnn Performance for Chemical Property Prediction | AIChE

(197at) Assessing the Impact of Quantum Mechanical Descriptors on D-Mpnn Performance for Chemical Property Prediction

Authors 

Wu, H., MIT
Spiekermann, K., Massachusetts Institute of Technology
Green, W., Massachusetts Institute of Technology
Directed-Message Passing Neural Networks (D-MPNNs) are widely used for predicting chemical properties such as solvation thermodynamics, regioselectivity, and reaction barriers and energies However, the vast chemical space poses a challenge for such machine learning models to generalize well beyond the chemistry contained in the training set. One augmentation method that has received growing interest is the application of quantum-mechanical descriptors to D-MPNNs and other ML methods to improve their generalizability and performance. However, such quantum mechanical (QM) descriptors require expensive computational chemical calculations to compute, and it is not well-established which descriptors to use for different property tasks or under what data regimes these descriptors are most impactful. In this work, we perform a systematic study on the impact of QM descriptors on D-MPNN performance for predicting chemical properties across fourteen different datasets. The selected QM descriptors include atom-level properties such as partial charges, spin-densities, and shielding constants, bond-level properties such as bond-length, bond-order, and hybridization characteristics, and molecular-level properties such as HOMO-LUMO gap and multipole moment values. The target datasets include a variety of chemical properties ranging from electronic energy and solubility to toxicity and protein binding affinity, both classification and regression tasks, and a variety of dataset sizes spanning from several hundred to one-hundred thousand data points. We explore the effect of the level of theory, the choice of descriptors on specific property prediction tasks, how to best include descriptors in the D-MPNN framework, how down-sampling within large datasets changes the descriptors’ impact, and differences between using directly computed QM descriptors and descriptors predicted by an additional D-MPNN as part of the workflow. We find that QM-descriptors can have a modest impact on the performance of D-MPNNs when training on small datasets or making predictions on difficult data splits where generalizability is key, but generation of additional data is often of greater benefit to model performance. Nonetheless, the improvement in small data regimes observed can be crucial when integrating ML models into de novo workflows for automatic molecular design and synthesis where experimental data is often limited. Generalizability is key in such applications as the targets are novel chemical species that are quite different to previously considered compounds. Additionally, even modest improvements in the accuracy of predictions are useful as false positives during synthesis are very expensive. In such applications, incorporating QM descriptors in conjunction with D-MPNNs can be a useful strategy for enhancing chemical property prediction and accelerating the autonomous molecular exploration workflow.