2024 AIChE Annual Meeting

(632a) Crystalformer: Integrating Transformer-Based Language Models in a Hybrid Approach to Relate Molecular Structure with Crystal Size Distribution

Authors

Pahari, S. - Presenter, TEXAS A&M UNIVERSITY
Kwon, J. - Presenter, Texas A&M University
Crystallization holds a pivotal role with significant implications in the realms of both chemistry and pharmaceuticals. It is estimated that between 70% and 80% of active pharmaceutical ingredients (APIs) have at least one crystallization step in their manufacturing process. Consequently, key performance indicators (KPIs) of formulated drug products like the effectiveness of drug release, bioavailability and content uniformity are significantly impacted by the crystal size distribution (CSD) of the products obtained from a crystallizer [1]. Therefore, understanding the relationship between the chemical structure of the crystallizing molecules and their CSD plays a crucial role in processes development, and it can aid us in screening quality API molecules that might lead to a higher yield and desired product attributes. However, given the vast diversity of molecules that are capable of crystallization, developing a generalized model that can capture this relationship presents a challenge.

The emergence of cheminformatics has been instrumental in addressing this challenge, particularly through advancements in molecular modeling and structure-property relationship studies. Some of these efforts have been successful in determining the structure of stable polymorphs and have provided significant insights into the crystallization process. Although such insights are extremely valuable in analyzing the CSD, the computational intensity of these molecular models makes their applications in the virtual screening of APIs infeasible [2]. To overcome these limitations, the field has leaned towards quantitative structure-property relationships (QSPR) that can easily leverage the recent advances in novel machine learning architectures and available data to predict crucial properties associated with crystals. For example, recent studies have explored a multitude of machine learning models to classify molecules based on their propensity to crystallize and subsequently help make “Go/No-Go’’ decisions for new product development [3, 4]. However, most of these studies are limited to a specific class of crystallizing molecules and in most cases, they do not relate the molecular structure of the crystallizing molecules to their CSDs.

To address the existing knowledge gap, we have developed a hybrid model, which combines the strength of machine learning with the domain knowledge of a first principles model, to relate the molecular structure of crystallizing species to their CSDs. The key feature that makes this model generalizable for all the crystallizing molecules is the concept of using machine-learnt molecular fingerprints [5]. Molecular fingerprints are the numerical representation of the molecules that forms the features in the QSPR models. Since the existing QSPR models rely on handcrafted fingerprints which are derived from intuition, these models become specific to a class of molecules. In this work, we have developed a large language model with the transformer-encoder architecture that predicts these fingerprints. Specifically, the encoder is trained on the unlabeled dataset of 1.6 billion small molecules using the masked language modeling method, where the input to the encoder is the molecular structures represented as a SMILES string. It is to be noted that the encoder is equipped with the self-attention mechanism, which allows it to learn the molecular fingerprints in the form of mathematical embedding, which we term as the ‘universal chemical language’. Since the encoder is trained on the large corpus of data, it can be used to generate the fingerprints (i.e., the embedding) for any new molecule. The embedding from the encoder is subsequently utilized in further regression models like neural networks to predict thermodynamic properties such as solubility of the molecules. Furthermore, a probabilistic regression model uses the embedding as features and predicts the distribution of the kinetic properties like the rate constant of crystal growth. Consequently, the kinetic and thermodynamic properties are utilized by a first-principles population balance equation to predict the CSD of the molecules. Overall, the framework utilizes a machine learning component to predict molecule-specific properties which are then utilized within a first-principles model to predict the CSD.

A case study on paracetamol demonstrates the framework's effectiveness, showing strong alignment between predicted solubilities, kinetic properties, and experimental data, leading to an accurate CSD prediction. This hybrid model represents a significant stride towards a generalized, efficient approach to understanding and predicting the crystallization properties of molecules, paving the way for more streamlined crystallization processes.

References.

[1] Meng, W., Sirota, E., Feng, H., McMullen, J. P., Codan, L., & Cote, A. S. (2020). Effective control of crystal size via an integrated crystallization, wet milling, and annealing recirculation system. Organic Process Research & Development, 24(11), 2639-2650.

[2] Lutsko, J. F. (2019). How crystals form: A theory of nucleation pathways. Science advances, 5(4), eaav7399.

[3] Pereira, F. (2020). Machine learning methods to predict the crystallization propensity of small organic molecules. CrystEngComm, 22(16), 2817-2826.

[4] Zhang, Y., Bertani, M., Pedone, A., Youngman, R. E., Tricot, G., Kumar, A., & Goel, A. (2024). Decoding crystallization behavior of aluminoborosilicate glasses: From structural descriptors to Quantitative Structure–Property Relationship (QSPR) based predictive models. Acta Materialia, 119784.

[5] Kuenneth, C., & Ramprasad, R. (2023). polyBERT: a chemical language model to enable fully machine-driven ultrafast polymer informatics. Nature Communications, 14(1), 4099.