(182b) Chemical Representations for Improving Retrosynthesis Prediction: Smiles-Grammar and Information Theory

Conference

AIChE Annual Meeting

Year

2021

Proceeding

2021 Annual Meeting

Group

Computing and Systems Technology Division

Session

CAST Director's Student Presentation Award Finalists (Invited Talks)

Time

Monday, November 8, 2021 - 3:45pm to 4:00pm

Authors

Mann, V. - Presenter

Venkatasubramanian, V., Columbia University

Machine learning (ML) methods are beginning to show significant success in many areas of computational chemistry such as materials discovery, chemical space navigation, reaction modeling, and retrosynthetic analysis. Retrosynthetic analysis has received particular attention with a wide range of ML methods showing significant promise. However, the state-of-the-art models are primarily data-driven and use simplified molecular-input line-entry scheme (SMILES) for representing molecules in complex neural networks-based architectures. Even though the limitations of SMILES representations are apparent â€“ lack of incorporation of chemistry knowledge, treating molecules purely as a collection of characters, and ignoring the structural relationships between them â€“ much of the work in this area continues to employ such representations.

Here, we propose a grammar-based representation to address some of these limitations. This representation is derived from the underlying SMILES-grammar for molecules using a context-free grammar (CFG) framework. We use an information-theoretic analysis for demonstrating the superiority of such representations by quantifying its information capacity using Shannon entropy. We also note the higher redundancy inherent to the grammar-representations through the lower conditional information content underlying these representations, as shown in Figure 1. We demonstrate our model's performance on a standard retrosynthesis prediction dataset. This is a filtered dataset derived from the US Patents and Trademark Office's (USPTO) database [1], further classified into ten different reaction classes [2]. This dataset contains only the reactants and products, with the reagent information removed and the SMILES strings canonicalized. The advantage of incorporating additional structural information translates into higher prediction accuracy compared to a similar work that uses only character-based SMILES representations. As a consequence of using the grammar-representations, the model is also forced to learn the underlying grammar [3,4] and hence the fraction of syntactically invalid predictions is significantly smaller, as shown in Figure 2. We remark that such improvements in accuracy could be observed in not just the retrosynthesis reaction prediction tasks but also other problems that currently utilize character-based representations of molecules.

References:

1. Daniel Mark Lowe.Extraction of chemical structures and reactions from the literature. PhD thesis, University of Cambridge, 2012.

2. Nadine Schneider, Nikolaus Stiefl, and Gregory A Landrum. Whatâ€™s what: The (nearly) definitive guide to reaction role assignment. Journal of chemical information and modeling, 56(12):2336â€“2346, 2016.

3. Vipul Mann and Venkat Venkatasubramanian. Predicting chemical reaction outcomes: A grammar ontology-based transformer framework. AIChE Journal, 67(3):e17190, 2021.

4. Vipul Mann and Venkat Venkatasubramanian. A formal grammar-based machine learning approach for predicting reaction outcomes. In 2020 Virtual AIChE Annual Meeting. AIChE, 2020.

5. Bowen Liu, Bharath Ramsundar, Prasad Kawthekar, Jade Shi, Joseph Gomes, Quang Luu Nguyen,Stephen Ho, Jack Sloane, Paul Wender, and Vijay Pande. Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS Central Science, 3(10):1103â€“1113, 2017.

Topics

Chemical Reaction Engineering

Chemicals & Materials

Computing and Systems Engineering

Other Sites & Tools

Technical Groups

Technical

Professional/Personal Growth

Societal Needs

Leadership

2025 Spring Meeting and 21st Global Congress on Process Safety

2025 AIChE Annual Meeting

Upcoming Conferences & Events

CEP: November 2024

CEP: October 2024

Explore Areas of Advancement:

Learning Center:

Want to be an Entrepreneur? Personal Stories From Three Successful Entrepreneurs Who Have Traveled This Path.