(182b) Chemical Representations for Improving Retrosynthesis Prediction: Smiles-Grammar and Information Theory
AIChE Annual Meeting
2021
2021 Annual Meeting
Computing and Systems Technology Division
CAST Director's Student Presentation Award Finalists (Invited Talks)
Monday, November 8, 2021 - 3:45pm to 4:00pm
Here, we propose a grammar-based representation to address some of these limitations. This representation is derived from the underlying SMILES-grammar for molecules using a context-free grammar (CFG) framework. We use an information-theoretic analysis for demonstrating the superiority of such representations by quantifying its information capacity using Shannon entropy. We also note the higher redundancy inherent to the grammar-representations through the lower conditional information content underlying these representations, as shown in Figure 1. We demonstrate our model's performance on a standard retrosynthesis prediction dataset. This is a filtered dataset derived from the US Patents and Trademark Office's (USPTO) database [1], further classified into ten different reaction classes [2]. This dataset contains only the reactants and products, with the reagent information removed and the SMILES strings canonicalized. The advantage of incorporating additional structural information translates into higher prediction accuracy compared to a similar work that uses only character-based SMILES representations. As a consequence of using the grammar-representations, the model is also forced to learn the underlying grammar [3,4] and hence the fraction of syntactically invalid predictions is significantly smaller, as shown in Figure 2. We remark that such improvements in accuracy could be observed in not just the retrosynthesis reaction prediction tasks but also other problems that currently utilize character-based representations of molecules.
References:
1. Daniel Mark Lowe.Extraction of chemical structures and reactions from the literature. PhD thesis, University of Cambridge, 2012.
2. Nadine Schneider, Nikolaus Stiefl, and Gregory A Landrum. Whatâs what: The (nearly) definitive guide to reaction role assignment. Journal of chemical information and modeling, 56(12):2336â2346, 2016.
3. Vipul Mann and Venkat Venkatasubramanian. Predicting chemical reaction outcomes: A grammar ontology-based transformer framework. AIChE Journal, 67(3):e17190, 2021.
4. Vipul Mann and Venkat Venkatasubramanian. A formal grammar-based machine learning approach for predicting reaction outcomes. In 2020 Virtual AIChE Annual Meeting. AIChE, 2020.
5. Bowen Liu, Bharath Ramsundar, Prasad Kawthekar, Jade Shi, Joseph Gomes, Quang Luu Nguyen,Stephen Ho, Jack Sloane, Paul Wender, and Vijay Pande. Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS Central Science, 3(10):1103â1113, 2017.