(368bp) Accurate Thermochemistry & Kinetics of Ionic Solutes Via Computational Chemistry | AIChE

(368bp) Accurate Thermochemistry & Kinetics of Ionic Solutes Via Computational Chemistry

Authors 

Zheng, J. - Presenter, Massachusetts Institute of Technology
Research Interests

My research experience and interests are in developing fast, accurate models for solvation phenomena of ions. Charged solutes play a significant role in drug discovery and development. Their solvation properties are key metrics in ADMET prediction through influence of solubility, partitioning, and ionizability of drug molecules. Ionic reactions are among the most common in chemical manufacturing. Despite their importance, datasets are scarce and predictive models are not accurate.

I leverage machine learning, quantum-chemical calculations, and cheminformatics to develop high-quality datasets that are used to then generate chemical insights and predictive models. My experience in data science is “end-to-end”, in that I have expertise in data curation & cheminformatics, software development, and model creation. In my poster, I will highlight major themes in my research under Prof. William H. Green at MIT:

(1) Solvation free energies

Despite the fundamental importance of solvation free energies of charged solutes in chemical modeling, predictive models are poor. One main reason is that their training datasets are small, with the largest containing just a few hundred data points in total (of which 112 are in water). In one project, I compiled a dataset of 273 experiment-derived solvation energies of ions in water. From the increased availability of data, structure-activity patterns became apparent, which I leveraged to develop corrections that reduced model error by approximately 66% (from 4.9 to 1.7 kcal/mol).

I am expanding on this work by including quantum-chemical calculations for other solvents, and leveraging the synthetic data to develop a predictive machine learning model. Other ongoing work includes developing large datasets and predictive models for radical and zwitterionic solutes.

(2) pKa

Working with the International Union of Pure and Applied Chemistry (IUPAC), I have curated a large aqueous pKa dataset that is the first to follow FAIR data principles, a dataset that is now incorporated into PubChem. I am also a contributor to an official IUPAC project related to compiling, curating, and correcting non-aqueous pKa data.

A separate project I worked on demonstrates that quantum chemistry (specifically COSMO-RS calculations), combined with data for aqueous pKa, can be used to accurately compute dissociation constants for pKa in other solvents.

Inconsistencies in cheminformatics have led to widespread systematic errors in data, which further lead to mistakes in predictive models, especially for amino acids and drug molecules. One theme of my work is in clarifying these mistakes, encouraging extra effort in cleaning and curating data for machine learning models.

Ongoing research is in leveraging the compiled data to develop predictive models for pKa in water and other solvents.

(3) Rate coefficients for ionic reactions

The effect of solvent on reaction rate for reactions of charged molecules is not well-understood. In this work, we compiled a set of experimental rate coefficients for 50 reactions, which we then benchmarked using quantum chemistry and transition-state theory. The mean absolute error of our calculations is less than 1 log unit (with higher errors for certain classes of reactions than others), suggesting that solvation models such as COSMO-RS can adequately describe solvent effects. We further showed that the accuracy of these computations is not sensitive to the underlying geometry optimization method, and that even GFN-xTB2 level geometries led to average errors within 1 magnitude.

Current work is in using this method to develop a synthetic dataset, which will be used to train a machine learning model that predicts solvent effects.

(4) Software Development

I am a contributor to several popular open-source chemistry packages: Chemprop, Reaction Mechanism Generator, Reaction Data and Molecular Conformers (RDMC), and py2opsin.

Overall mission

I am interested in a position that leverages my experience in a broad range of computational machine learning - including data curation / cheminformatics, software development, chemical insight, and model creation - to improve the lives of people. I aim to develop models that ultimately are used to do good, by developing new materials or by discovering new insight.