(284a) Data-Driven QSAR Modeling for the Iterative Identification of Chemical Motifs from Limited Data | AIChE

(284a) Data-Driven QSAR Modeling for the Iterative Identification of Chemical Motifs from Limited Data

Authors 

Song, H. S. - Presenter, University of Nebraska-Lincoln
Park, C. M., Korea Research Institute of Chemical Technology
Jang, S., Institut Pasteur Korea
Kim, J. S., J2H Biotech
Ryu, H. C., J2H Biotech
Seong, H., Korea Research Institute of Chemical Technology
Drug discovery and development is a lengthy and costly process composed of several complex stages, including target validation, high-throughput screening, hit to lead, lead optimization, preclinical/clinical tests, and post-clinical evaluation. Advanced data-driven modeling techniques such as artificial intelligence and deep learning (DL) offer the promise to revolutionize the drug development pipeline through significant reduction in time and cost. One of the key utilities of DL models is to build QSAR (quantitative structure–activity relationship) or QSPR (quantitative structure-property relationship) models that enable accurate identification of the relationships between chemical structures and their activities or properties by leveraging big data. While groundbreaking, development of predictive DL models is often challenging, particularly when sufficient datasets required for training are not available, a common issue encountered at an initial stage of research.

In this work, we present a novel workflow to address this limitation. Our strategy is to develop a relatively simple data-driven model using a small set of data and subsequently use this initial model to guide the next round of experiments for generating new datasets. The resulting data is then fed back to the model for refinement. We continue this iterative process until the predictions of QSAR or QSPR models are consistent with experimental measurements. During this process, the complexity and depth of the model are accordingly increased. Once validated, the model is finally used to suggest new chemical structures that potentially outperform any of the previously tested chemical compounds. We demonstrate our pipeline through the case study of Staphylococcus aureus, a pathogen causing various life-threatening infections, including skin, lung, heart valve and bone infections. We represent structural information of compounds using the SMILES (Simplified Molecular Input Line Entry System) code; define atomic and symbolic features for each of the entries in the SMILES string; develop a first version of the model using SINDy (Sparse Identification of Nonlinear Dynamics). Despite data deficiency, this initial model shows reasonably accuracy and enables identifying chemical motifs (i.e., substructures of compounds) that may determine the drug activity. Furthermore, the model allows us to evaluate a comprehensive list of compounds in public databases and select top candidates based on the model-estimated activities. The workflow developed through this work therefore greatly facilitates otherwise time-consuming steps in identifying drug candidates through iterative interactions between modeling and experimental groups.