(12c) Creating a Universal Chemical Language: Leveraging Transformers to Enhance Molecular Descriptors for Quantitative Structure-Property Relationships | AIChE

(12c) Creating a Universal Chemical Language: Leveraging Transformers to Enhance Molecular Descriptors for Quantitative Structure-Property Relationships

Authors 

Pahari, S. - Presenter, TEXAS A&M UNIVERSITY
Kwon, J. - Presenter, Texas A&M University
Over the last decade, machine learning (ML) has become a focal point of research within cheminformatics, particularly for its role in developing quantitative structure-property relationships (QSPR). The primary workflow adopted in developing a QSPR involves providing a set of effective molecular descriptors that can sufficiently describe the input molecules to the model. In earlier studies, molecular descriptors like the Randić molecular profiles, geometrical descriptors, radial distribution function, topology, and multiple two-dimensional and three-dimensional structural features were utilized as inputs to the QSPR models. However, because of considering multiple correlated structural features as inputs, challenges pertaining to poor performance and the risk associated with overfitting these models arose. To address this challenge, feature selection procedures like filtering based on correlation scores, principal component analysis (PCA), and genetic algorithms (GA) were implemented [1]. However, a limitation associated with this conventional feature screening procedure is that a generalized set of molecular descriptors for the QSPR models cannot be obtained. Furthermore, the accuracy of the model greatly depends on the set of initial features that are considered prior to the screening step.

To address the challenges associated with futurizing the input space for the QSPR models, recent studies in cheminformatics have adopted molecular fingerprints. Molecular fingerprints are numerical representations of molecules that are expected to encode key structural and chemical features. There is a wide array of handcrafted fingerprinting methods that have been developed based on valuable experience and intuition. For example, one of the most renowned fingerprints adopted in the literature is the Morgan fingerprint, also known as the extended-connectivity fingerprint (ECFP4) [2]. Although such fingerprints are validated for small molecules, they significantly lack generalizability and create significant hindrances to efficient model training. To address these limitations, machine-learned fingerprints have been adopted by leveraging modern machine-learning architectures like autoencoders. Specifically, the autoencoder uses an unlabeled dataset for molecular structures to encode these molecules into a latent space [3]. This latent space representations are then utilized for property predictions. The primary benefit of using such machine-learned fingerprints is the advantages associated with the automated deployment of machine-learning pipelines with the help of cloud computing. However, the architecture of neural networks (i.e., deep neural networks or graph neural networks) used to build an autoencoder still prevents its generalizability. Furthermore, the scalability of these models remains significantly constrained owing to the limitations associated with the adopted architectures. To sidestep the drawbacks of autoencoder-derived machine-learned fingerprints, the field has begun exploring transformers. Known for their prowess in natural language processing (NLP) by capturing semantic relationships via contextual analysis [4], transformers have been leveraged to develop QSPR models with labeled datasets. Despite their potential, only a handful of studies have investigated their use in generating machine-crafted fingerprints, pointing to an emerging area of research with significant promise [5].

To bridge this knowledge gap, we have leveraged the multiheaded attention mechanism and the excellent transfer learning capability of transformers to derive the most general set of machine-learned fingerprints, which we term as the ‘universal chemical language’. Developing the ‘universal chemical language’ has three important components: the textual representation of molecular structure with a SMILES string and tokenizing it, then using absolute positional encoding to introduce the idea of the relative position of atoms among the tokens, and subsequently utilizing encoder blocks with a self-attention mechanism to achieve contextual understanding among the tokenized atoms. Mathematically, the ‘universal chemical language’ is represented as an embedding obtained from the encoder blocks. The encoder-transformer model is trained utilizing the masked language modeling method with an unlabeled dataset of molecular structures containing 1.6 billion small molecules. The embedding obtained from the transformer is utilized in downstream regression models, which are fine-tuned with domain-specific labeled datasets. This embedding is further utilized to predict specific properties. To highlight the general nature of these embeddings, they have been utilized to predict multiple physical properties and quantum mechanical properties from the QM9 dataset. Its application has been extended to relate molecular structure to the crystal size distribution of crystallizing molecules, relating molecular structure of solvents to their CO2 absorption capability with varying temperature and pressure, and protein-ligand interactions. The development of these generalized machine-crafted fingerprints opens avenues for accelerated molecule discovery and structure screening.

References.

[1] Goodarzi, M., Dejaegher, B., & Heyden, Y. V. (2012). Feature selection methods in QSAR studies. Journal of AOAC International, 95(3), 636-651.

[2] Capecchi, A., Probst, D., & Reymond, J. L. (2020). One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome. Journal of cheminformatics, 12, 1-15.

[3] Das, K., Samanta, B., Goyal, P., Lee, S. C., Bhattacharjee, S., & Ganguly, N. (2022). CrysXPP: An explainable property predictor for crystalline materials. npj Computational Materials, 8(1), 43.

[4] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

[5] Cao, Z., Magar, R., Wang, Y., & Barati Farimani, A. (2023). Moformer: self-supervised transformer model for metal–organic framework property prediction. Journal of the American Chemical Society, 145(5), 2958-2967.