(528c) Machine Learning for Thermodynamic Properties of Materials: Towards the Prediction of Solubility | AIChE

(528c) Machine Learning for Thermodynamic Properties of Materials: Towards the Prediction of Solubility

Authors 

Vermeire, F. - Presenter, Massachusetts Institute of Technology
Green, W., Massachusetts Institute of Technology
Introduction

Solubility is one of the most important properties in the development and design of drugs. It plays a big role in pharmacokinetics and pharmacodynamics, where it relates to dosing, toxicity and the effect of drugs on their targets. In the field of process design, knowledge on the solubility of drugs and other intermediate products are crucial in the design of separation techniques and for the transitioning from batch to flow processes. Computer-aided screening of the solubility of API’s can dramatically shorten the drug discovery cycle and the costs related to the development and manufacturing of new drugs.

State of the art solubility models focus on the prediction of octanol-water partitioning coefficients (logP) and solid aqueous solubility (logS) because of their role in the evaluation of the drug-likeness of API’s in an initial screening process. Although models for octanol-water partitioning coefficients have improved a lot the last couple of years, the prediction of aqueous solubility for a broad range of drug-like compounds remains challenging. In view of downstream manufacturing and process intensification, the interest in solubility prediction is not limited to octanol or water as a solvent. Computer-aided screening of potential solvents can contribute significantly to a reduction of waste streams and ease the upscaling towards continuous processing. However, models for solubility predictions in other organic solvents are limited by the scarce availability of experimental data.

This works aims at using machine learning techniques and physical relationships to predict the solubility of gases, liquids and solids in a variety of solvents. Theoretical quantum calculations are used to compensate for scarce and biased experimental databases. Furthermore, the physical relation between gas-liquid solvation energies, liquid-liquid partitioning coefficients and solid solubility is used to aid solubility predictions for drug-like molecules in a broad range of solvents.

Methodology

Graph convolutional neural networks are used to encode the molecular structure of both the solute and solvent. A molecular embedding is learned separately for the solute and solvent with the directed message passing neural network, as implemented in Chemprop. The solute and solvent embeddings are concatenated and fed through another dense layer for the property prediction.

Quantum chemical calculations are done to determine the gas-liquid solvation free energies for various solute and solvent combinations. These calculations are used to pretrain the message passing neural network, such that physical interactions between solvents and solutes can be learned. Contrary to the experimental gas-liquid solvation free energies, this dataset is not biased towards certain solvents (eg. water). Using theoretical data to pretrain the model also reduces the effect of aleatoric errors, caused by data noise, on the final model predictions. The dataset can be extended towards classes of solutes and solvents for which no or limited experimental data is available, improving the generalizability of the model.

For the prediction of liquid-liquid partitioning coefficients and solid solubilities, the physical relationship with the well-established gas-liquid solvation free energies is enforced. Liquid-liquid partitioning coefficients are directly related to the ratio of the gas-liquid partitioning coefficients of the same solute in both solvents. Some error is introduced by the mutual solubility of the solvents, which will be accounted for in the loss function. The solid solubility is related to the gas-liquid partitioning coefficients and the fusion free energy of the crystal structure. To account for the fusion free energy, an additional model is trained on an experimental dataset of melting temperatures and fusion enthalpies.

Results

A preliminary theoretical dataset is constructed with COSMOtherm. It consists of >20k data points and includes 78 solvents and 244 solutes. Solvation free energies of the theoretical dataset are predicted by the model with a RMSE of 0.08 kcal/mol and a MAE of 0.04 kcal/mol. Analysis of the solute and solvent molecular embeddings demonstrate the learned physical interactions. A principal component analysis of the solvent molecular embeddings clusters solvents according to their physical properties relevant to solvation. Analysis of the atom messages for some solute molecules highlights atoms with an important hydrogen bonding/donating character.

The model trained on the theoretical dataset is further refined using experimental data for solvation free energies. The experimental dataset is gathered from different sources including the Minnesota solvation database, the Compsol database and different publications related to the regression of Abrahams solute and solvent parameters. The total experimental dataset has more than 8900 entries for solute-solvent combinations with over 300 different solvents. The machine learning model uses molecular embeddings for the solute and solvent molecules that are learned from the theoretical dataset. The dense layer for property prediction is initialized by the learned weights from the theoretical dataset and optimized with the experimental solvation free energies. Solvation free energies in a randomly selected test set can be predicted with a RMSE of 0.60 (±0.04) kcal/mol and MAE of 0.27 (±0.02) kcal/mol. These results are similar compared to those of a model without pretraining. However, if no pretraining is done on a theoretical dataset, analysis of the molecular embeddings do not directly demonstrate learning of any physical interactions. Moreover, the predictions on “unseen” solvents, i.e. those not included in the training set, improve significantly if pretraining is applied.

For the prediction of partitioning coefficients, a dataset available in OChem is used. This dataset includes 3600 entries for partitioning coefficients between water and various solvents including chloroform, cyclohexane, toluene and hexadecane. The model trained on theoretical gas-liquid solvation energies is used only for prediction of the experimental logP data with a RMSE 1.18 and MAE 0.66. Further efforts on improving predictions of partitioning coefficients will include (1) collection of a broader dataset (2) refine a pretrained model on experimental data enforcing the physical relation with gas-liquid solvation energies.

Solid solubility can be related to gas-liquid solvation energies of the solute in the respective solvent and the solute fusion free energy. The fusion free energy of a crystal structure can be approximated with a known melting temperature and fusion enthalpy. Experimental databases on latter two properties are gathered from different sources, including OChem and NIST. The database is used to train a machine learning model with a directed message passing neural network to construct the molecular embeddings, as implemented in Chemprop. Fusion enthalpies are predicted with a RMSE of 1.69 (±0.14) kcal/mol and melting temperatures with a RMSE of 33.7 (±1.6) K. These predictions will be combined with the model for gas-liquid solvation energies to create a model for solid solubilities in a broad range of solvents.

Conclusions

A machine learning model is developed for the prediction of gas-liquid solvation free energies. The molecular embeddings for solvent and solute are done with a directed message passing neural network. Experimental solvation free energies are predicted using the new model, pretrained on theoretical data, with a RMSE of 0.60 kcal/mol. The pretraining on a theoretical dataset ensures that the model captures the physical interactions between solvent and solute molecules and improves the performance of the model on unseen solvents.

The physical relationship between gas-liquid solvation energies, partitioning coefficients and solid solubility will be used to enable the prediction of those properties in a broad range of solvents. The solute fusion enthalpy and melting temperature are required for the solid solubility predictions. Those are predicted by another model with a RMSE of 1.69 kcal/mol and 33.7 K respectively.

Figure

Figure 1. Parity plots for the predictions of gas-liquid solvation free energies ΔGsolv, fusion enthalpies ΔHfus and melting temperatures Tm.

Checkout

This paper has an Extended Abstract file available; you must purchase the conference proceedings to access it.

Checkout

Do you already own this?

Pricing

Individuals

AIChE Pro Members $150.00
AIChE Emeritus Members $105.00
AIChE Graduate Student Members Free
AIChE Undergraduate Student Members Free
AIChE Explorer Members $225.00
Non-Members $225.00