(284a) Data-Driven QSAR Modeling for the Iterative Identification of Chemical Motifs from Limited Data
AIChE Annual Meeting
2022
2022 Annual Meeting
Topical Conference: Chemical Engineers in Medicine
Big Data and Machine Learning to Advance Medicine
Tuesday, November 15, 2022 - 8:00am to 8:19am
In this work, we present a novel workflow to address this limitation. Our strategy is to develop a relatively simple data-driven model using a small set of data and subsequently use this initial model to guide the next round of experiments for generating new datasets. The resulting data is then fed back to the model for refinement. We continue this iterative process until the predictions of QSAR or QSPR models are consistent with experimental measurements. During this process, the complexity and depth of the model are accordingly increased. Once validated, the model is finally used to suggest new chemical structures that potentially outperform any of the previously tested chemical compounds. We demonstrate our pipeline through the case study of Staphylococcus aureus, a pathogen causing various life-threatening infections, including skin, lung, heart valve and bone infections. We represent structural information of compounds using the SMILES (Simplified Molecular Input Line Entry System) code; define atomic and symbolic features for each of the entries in the SMILES string; develop a first version of the model using SINDy (Sparse Identification of Nonlinear Dynamics). Despite data deficiency, this initial model shows reasonably accuracy and enables identifying chemical motifs (i.e., substructures of compounds) that may determine the drug activity. Furthermore, the model allows us to evaluate a comprehensive list of compounds in public databases and select top candidates based on the model-estimated activities. The workflow developed through this work therefore greatly facilitates otherwise time-consuming steps in identifying drug candidates through iterative interactions between modeling and experimental groups.