(2ge) Multi-Fidelity Computer-Aided Molecular Design | AIChE

(2ge) Multi-Fidelity Computer-Aided Molecular Design

Authors 

Greenman, K. P. - Presenter, Massachusetts Institute of Technology
Research Interests

My interests and expertise are in combining machine learning with physics-based calculations and high-throughput experiments for molecular design and discovery.

Prior Work

In my PhD work with Profs. Rafael Gómez-Bombarelli and William Green at MIT, I have worked on several projects, including:

(1) Developed multi-fidelity deep learning models to achieve state-of-the-art predictions of absorption wavelength of dye molecules in a variety of solvents, based on a combination of data from time-dependent density functional theory and experiments.

(2) Integrated models from (1) into a closed-loop active learning framework with high-throughout synthesis and characterization experiments, in collaboration with Profs. Klavs Jensen, Regina Barzilay, and Tommi Jaakola. Exploration to improve models in high-uncertainty regions of chemical space, and exploitation to discover novel dye molecules with desired properties.

(3) Developed tool for extracting domain-specific molecular structures from US patents using keyword queries. Used these structures and high-throughout density functional theory calculations to train generative models to propose novel organic photodiodes.

(4) Served as a lead maintainer of the open-source Chemprop package for molecular property prediction, with over 1000 stars on GitHub.

I also completed a micro-internship at Microsoft Research New England, where I created a benchmark of uncertainty quantification strategies for protein engineering. I applied these uncertainty methods in active learning and Bayesian optimization to identify the effect of domain shift on these tasks.

Additionally, I completed an internship at Eli Lilly, where I studied multi-fidelity modeling of bioactivity, as well as impurity prediction for deep learning forward synthesis models.

Future Vision

Building on my previous work, my research group will initially address the following aims:

(1) Create molecular benchmark datasets for the study of multi-fidelity, multi-objective, batch active learning.

Molecular design is typically a multi-objective task, and training data for machine learning models used for this task is often available at more than one level of fidelity with a cost-accuracy tradeoff (e.g. theoretical calculations and experiments). Furthermore, it is often advantageous to intelligently choose experiments using active learning to reduce costs over random searches. When active learning is used to propose new measurements that will be most informative to a model, these should ideally be proposed in batches to be compatible with high-throughout experimental setups. A large body of work has been done on single-fidelity modeling for single objectives, and single-sample active learning is also routinely studied in chemical design. Recent work has begun to extend design to the multi-fidelity, multi-objective, batch case, but many challenges remain. My group will develop benchmark datasets and evaluate state-of-the-art approaches on these benchmarks to enable more efficient design and discovery of new molecules.

(2) Design new molecular optical probes for biomedical imaging applications.

Molecular optical probes can be used for biomedical imaging and guided surgery. Applications typically require the molecules to satisfy several design constraints, including peak absorption and emission wavelength, Stokes shift, quantum yield, solubility, and toxicity. My group will use the techniques studied in aim (1) to design novel optical probes with improved properties. In particular, we will focus on molecules that absorb or emit near-infrared light because biological tissue is more transparent in this range of the spectrum.

(3) Develop tools for robust natural language explanations of uncertainty quantification, acquisition, and design choices.

Explainability tools for machine learning allow for more interpretable predictions/decisions and can build trust among chemical domain experts. We will integrate state-of-the-art attribution methods with large language models and study how to produce robust natural-language explanations of machine learning models based on user questions. Our work will focus on explaining why a model is uncertain for a particular prediction or in a general region of chemical space, why a particular batch of molecules was acquired during active learning, and how a molecule can be modified to fit design criteria.

Teaching Interests

I would be glad to teach any core undergraduate or graduate chemical engineering course, and I would be particularly excited about teaching numerical methods, fluid mechanics, or kinetics and reactor design. In addition, my previous research and teaching experience has prepared me to teach electives in molecular modeling / computational chemistry or machine learning / data science.

I have participated in several pedagogical training opportunities and have earned certificates in subject design, lesson planning, and inclusive teaching from MIT’s Teaching and Learning Lab. I have also put this training into practice as an instructional aide for undergraduate fluid mechanics, as a teaching assistant for graduate machine learning for molecular design, and as a mentor of five undergraduate students on research projects. I won first place in the MIT chemical engineering department’s 2021 teach-off competition for a lesson I prepared and taught.

As an undergraduate student, I led the creation of the computational curriculum for a new course to reduce barriers to undergraduate research (in collaboration with faculty, graduate students, and other undergraduates). I published an article on this course in the Chemical Engineering Education journal. In the future, I plan to explore opportunities to incorporate artificial intelligence tools (including large language models such as ChatGPT) into the chemical engineering curriculum and to conduct pedagogical research on this topic.