(182b) Chemical Representations for Improving Retrosynthesis Prediction: Smiles-Grammar and Information Theory | AIChE

(182b) Chemical Representations for Improving Retrosynthesis Prediction: Smiles-Grammar and Information Theory

Authors 

Machine learning (ML) methods are beginning to show significant success in many areas of computational chemistry such as materials discovery, chemical space navigation, reaction modeling, and retrosynthetic analysis. Retrosynthetic analysis has received particular attention with a wide range of ML methods showing significant promise. However, the state-of-the-art models are primarily data-driven and use simplified molecular-input line-entry scheme (SMILES) for representing molecules in complex neural networks-based architectures. Even though the limitations of SMILES representations are apparent – lack of incorporation of chemistry knowledge, treating molecules purely as a collection of characters, and ignoring the structural relationships between them – much of the work in this area continues to employ such representations.

Here, we propose a grammar-based representation to address some of these limitations. This representation is derived from the underlying SMILES-grammar for molecules using a context-free grammar (CFG) framework. We use an information-theoretic analysis for demonstrating the superiority of such representations by quantifying its information capacity using Shannon entropy. We also note the higher redundancy inherent to the grammar-representations through the lower conditional information content underlying these representations, as shown in Figure 1. We demonstrate our model's performance on a standard retrosynthesis prediction dataset. This is a filtered dataset derived from the US Patents and Trademark Office's (USPTO) database [1], further classified into ten different reaction classes [2]. This dataset contains only the reactants and products, with the reagent information removed and the SMILES strings canonicalized. The advantage of incorporating additional structural information translates into higher prediction accuracy compared to a similar work that uses only character-based SMILES representations. As a consequence of using the grammar-representations, the model is also forced to learn the underlying grammar [3,4] and hence the fraction of syntactically invalid predictions is significantly smaller, as shown in Figure 2. We remark that such improvements in accuracy could be observed in not just the retrosynthesis reaction prediction tasks but also other problems that currently utilize character-based representations of molecules.

References:

1. Daniel Mark Lowe.Extraction of chemical structures and reactions from the literature. PhD thesis, University of Cambridge, 2012.

2. Nadine Schneider, Nikolaus Stiefl, and Gregory A Landrum. What’s what: The (nearly) definitive guide to reaction role assignment. Journal of chemical information and modeling, 56(12):2336–2346, 2016.

3. Vipul Mann and Venkat Venkatasubramanian. Predicting chemical reaction outcomes: A grammar ontology-based transformer framework. AIChE Journal, 67(3):e17190, 2021.

4. Vipul Mann and Venkat Venkatasubramanian. A formal grammar-based machine learning approach for predicting reaction outcomes. In 2020 Virtual AIChE Annual Meeting. AIChE, 2020.

5. Bowen Liu, Bharath Ramsundar, Prasad Kawthekar, Jade Shi, Joseph Gomes, Quang Luu Nguyen,Stephen Ho, Jack Sloane, Paul Wender, and Vijay Pande. Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS Central Science, 3(10):1103–1113, 2017.