(64e) Prediction of Organic Compound Aqueous Solubility Using Interpretable Machine Learning-a Comparison Study of Descriptor-Based and Topological Models | AIChE

(64e) Prediction of Organic Compound Aqueous Solubility Using Interpretable Machine Learning-a Comparison Study of Descriptor-Based and Topological Models

Authors 

Alshami, A., University of North Dakota
A reliable and practical determination of a chemical species’ solubility in water continues to be examined using empirical observations and exhaustive experimental studies alone. Predictions of chemical solubility in water using data-driven algorithms can allow us to create a rationally designed, efficient, and cost-effective tool for next-generation materials and chemical formulations. We present results from two machine learning (ML) modeling studies to adequately predict various species’ solubility using data for over 8,400 compounds. Molecular-descriptors, the most used method in previous studies, and Morgan fingerprint, a topological, circular-based hash of the molecules' structures, were applied to produce water solubility estimates. We trained all models on 80% of the total datasets using the Random Forest (RFs) technique as the regressor and tested the prediction performance using the remaining 20%, resulting in R 2 test values of 0.88 and 0.82 for the descriptors and circular fingerprint methods, respectively.

Our results were compared to a blind, low molecular database consists of 32 low organic molecules with the number of C atoms ranging from 1 to 12 with specified aqueous solubility experiments, revealing that using a fingerprint method has a lower average absolute calculation error (∼0.25 log units), which is comparable to other group contribution methods currently available. The average uncertainty in measured aqueous solubility for organic molecules represented ∼0.6 log units or higher, when the solubility values were gathered from various published sources. We also gained insight into how important features impact an ML's output using SHAP analysis and calculated Gibbs energies for these features to investigate their thermodynamic favorability. The fingerprint method can predict the aqueous solubility with low error, and its interpretability ability will distinguish it from the other currently available methods.