(197av) Predicting Reaction Performance in Amide Bond Formation Using Machine Learning: The Role of High-Quality Data | AIChE

(197av) Predicting Reaction Performance in Amide Bond Formation Using Machine Learning: The Role of High-Quality Data

Authors 

Lee, Y. S. - Presenter, Imperial College London
Adjiman, C. S. - Presenter, Imperial College
Galindo, A., Imperial College London
Access to models that can predict the yield of organic reactions as a function of reaction conditions is of key importance as it can significantly accelerate the discovery and optimization of new chemical reactions. Recently, the rapid growth of artificial intelligence (AI) has led to the emergence of a variety of novel machine learning (ML) approaches, as efficient alternatives to time-consuming experimental screening, for reaction yield prediction [1,2] and reaction condition optimization [3,4]. Although the performance of such ML approaches and their potential roles in predictive chemistry have been demonstrated for specific reaction classes [1-4], their predictive accuracy is often limited by the lack of high-quality data; most of the publicly available data, such as USPTO and Reaxys, typically contain a limited subset of successful experiments at a small range of process conditions, and the concentrations of precursors, reaction times, temperatures and kinetics are only partially or not adequately reported [5]. As a result, the prediction models built upon these databases tend to be highly biased toward reactions with high yields.

In this work, we systematically investigate the impact of including unsuccessful reaction outcomes and reaction conditions in the development of ML-based reaction prediction models for amide bond formation. Generic data are extracted from the Reaxys database [6] to test the benchmark performance of the predictive models. In order to achieve accurate and unbiased results, a custom-made (CM) dataset for the amide bond formation is generated by sampling the generic reaction data such that the distribution of yield mimic those of the published high-throughput experimentation (HTE) data set [7]. Having curated the data, the ML-based models for predicting reaction yield [1] and optimizing reaction conditions [3] are trained. Finally, the comparative performance of such models is assessed to evaluate the importance of data quality and to investigate the applicability of CM data for building predictive reaction models.

References
[1] Schwaller, P., Vaucher, A.C., Laino, T. and Reymond, J.L., 2021. Prediction of chemical reaction yields using deep learning. Machine learning: science and technology, 2(1), p.015016.
[2] Ahneman, D.T., Estrada, J.G., Lin, S., Dreher, S.D. and Doyle, A.G., 2018. Predicting reaction performance in C–N cross-coupling using machine learning. Science, 360(6385), pp.186-190.
[3] Shields, B.J., Stevens, J., Li, J., Parasram, M., Damani, F., Alvarado, J.I.M., Janey, J.M., Adams, R.P. and Doyle, A.G., 2021. Bayesian reaction optimization as a tool for chemical synthesis. Nature, 590(7844), pp.89-96.
[4] Hickman, R.J., Aldeghi, M., Häse, F. and Aspuru-Guzik, A., 2022. Bayesian optimization with known experimental and design constraints for chemistry applications. Digital Discovery, 1(5), pp.732-744.
[5] Struble, T.J., Alvarez, J.C., Brown, S.P., Chytil, M., Cisar, J., DesJarlais, R.L., Engkvist, O., Frank, S.A., Greve, D.R., Griffin, D.J. and Hou, X., 2020. Current and future roles of artificial intelligence in medicinal chemistry synthesis. Journal of medicinal chemistry, 63(16), pp.8667-8682.
[6] Reaxys, Elsevier B.V.
[7] Avila, C., Cassani, C., Kogej, T., Mazuela, J., Sarda, S., Clayton, A.D., Kossenjans, M., Green, C.P. and Bourne, R.A., 2022. Automated stopped-flow library synthesis for rapid optimisation and machine learning directed experimentation. Chemical Science, 13(41), pp.12087-12099.