(12c) Creating a Universal Chemical Language: Leveraging Transformers to Enhance Molecular Descriptors for Quantitative Structure-Property Relationships
AIChE Annual Meeting
2024
2024 AIChE Annual Meeting
Computing and Systems Technology Division
10: CAST Director's Student Presentation Award Finalists (Invited Talks)
Sunday, October 27, 2024 - 4:06pm to 4:24pm
To address the challenges associated with futurizing the input space for the QSPR models, recent studies in cheminformatics have adopted molecular fingerprints. Molecular fingerprints are numerical representations of molecules that are expected to encode key structural and chemical features. There is a wide array of handcrafted fingerprinting methods that have been developed based on valuable experience and intuition. For example, one of the most renowned fingerprints adopted in the literature is the Morgan fingerprint, also known as the extended-connectivity fingerprint (ECFP4) [2]. Although such fingerprints are validated for small molecules, they significantly lack generalizability and create significant hindrances to efficient model training. To address these limitations, machine-learned fingerprints have been adopted by leveraging modern machine-learning architectures like autoencoders. Specifically, the autoencoder uses an unlabeled dataset for molecular structures to encode these molecules into a latent space [3]. This latent space representations are then utilized for property predictions. The primary benefit of using such machine-learned fingerprints is the advantages associated with the automated deployment of machine-learning pipelines with the help of cloud computing. However, the architecture of neural networks (i.e., deep neural networks or graph neural networks) used to build an autoencoder still prevents its generalizability. Furthermore, the scalability of these models remains significantly constrained owing to the limitations associated with the adopted architectures. To sidestep the drawbacks of autoencoder-derived machine-learned fingerprints, the field has begun exploring transformers. Known for their prowess in natural language processing (NLP) by capturing semantic relationships via contextual analysis [4], transformers have been leveraged to develop QSPR models with labeled datasets. Despite their potential, only a handful of studies have investigated their use in generating machine-crafted fingerprints, pointing to an emerging area of research with significant promise [5].
To bridge this knowledge gap, we have leveraged the multiheaded attention mechanism and the excellent transfer learning capability of transformers to derive the most general set of machine-learned fingerprints, which we term as the âuniversal chemical languageâ. Developing the âuniversal chemical languageâ has three important components: the textual representation of molecular structure with a SMILES string and tokenizing it, then using absolute positional encoding to introduce the idea of the relative position of atoms among the tokens, and subsequently utilizing encoder blocks with a self-attention mechanism to achieve contextual understanding among the tokenized atoms. Mathematically, the âuniversal chemical languageâ is represented as an embedding obtained from the encoder blocks. The encoder-transformer model is trained utilizing the masked language modeling method with an unlabeled dataset of molecular structures containing 1.6 billion small molecules. The embedding obtained from the transformer is utilized in downstream regression models, which are fine-tuned with domain-specific labeled datasets. This embedding is further utilized to predict specific properties. To highlight the general nature of these embeddings, they have been utilized to predict multiple physical properties and quantum mechanical properties from the QM9 dataset. Its application has been extended to relate molecular structure to the crystal size distribution of crystallizing molecules, relating molecular structure of solvents to their CO2 absorption capability with varying temperature and pressure, and protein-ligand interactions. The development of these generalized machine-crafted fingerprints opens avenues for accelerated molecule discovery and structure screening.
References.
[1] Goodarzi, M., Dejaegher, B., & Heyden, Y. V. (2012). Feature selection methods in QSAR studies. Journal of AOAC International, 95(3), 636-651.
[2] Capecchi, A., Probst, D., & Reymond, J. L. (2020). One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome. Journal of cheminformatics, 12, 1-15.
[3] Das, K., Samanta, B., Goyal, P., Lee, S. C., Bhattacharjee, S., & Ganguly, N. (2022). CrysXPP: An explainable property predictor for crystalline materials. npj Computational Materials, 8(1), 43.
[4] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
[5] Cao, Z., Magar, R., Wang, Y., & Barati Farimani, A. (2023). Moformer: self-supervised transformer model for metalâorganic framework property prediction. Journal of the American Chemical Society, 145(5), 2958-2967.