(592d) Impact: The Industrial Microbiology Publication and AI Crowdsourced Toolbox | AIChE

(592d) Impact: The Industrial Microbiology Publication and AI Crowdsourced Toolbox

Authors 

Moon, H., 4. Clayton High School, 1 Mark Twain Cir
Czajka, J., Washington University in St. Louis
Chen, Y., Washington University in St. Louis
Tang, Y., Washington University in St. Louis
Abstract.

As Machine Learning (ML) and Artificial Intelligence (AI) technologies continue to make significant strides across various scientific fields, they are increasingly becoming integral to the metabolic engineering and synthetic biology sectors. However, the advancement of ML in these domains is frequently hampered by critical challenges, such as the scarcity of comprehensive datasets and the redundancy of features necessary for methodological innovation. Furthermore, the process of developing bioprocess-related ML models is notably inefficient and prone to duplication, largely due to the absence of established collaborative frameworks.

To address the challenge of accessing and extracting this valuable information, we introduce the Industrial Microbiology Publication and AI Crowdsourced Toolbox (ImpactDB), a website that can host both modeling tools and published biomanufacturing information. To date, ImpactDB has listed approximately 1900 published studies, providing users with easy access to a wide array of publication details such as titles, authors, and comprehensive abstracts. Beyond simply aggregating studies, ImpactDB has extracted over 3000 detailed bioproduction results from 192 papers (mainly for cyanobacteria and Yarrowia yeast), a number which expands to 4600 through the innovative use of large language models (LLMs) like GPT-4 for automated information retrieval. However, human supervision is still crucial in order to maintain data integrity as possible data loss and the accuracy fluctuation may occur. Further limitation applies on graphical data and token limitations, which requires advance multimodal language models. Assisted by LLM, the knowledge extracted from each paper are organized into three types of features: genetic engineering features (e.g., strains, pathway engineering steps, heterologous enzymes, promoter strengths, and number of gene knockouts), bioreactor and cultivation features (e.g., reactor size and modes, mixing, oxygen, medium, and carbon substrate type), and metabolic features (e.g., metabolic fluxes, reaction Gibbs free energy, and product formation energy). This structured approach facilitates a comprehensive exploration of the metabolic engineering landscape, from strains and pathway engineering steps to bioreactor configurations and metabolic fluxes. To facilitate data extraction and organization, ImpactDB implements the Molecular inventory along with a pathway map. The inventory incorporates common features including chemical and physical properties of substrates and products (e.g., solubility, molecular weight, elemental compositions). The inventory also includes biological information (e.g., enzymatic steps, cofactor costs, and ATP costs). More importnatly, ImpactDB offers Crowdsourced function that allows the researchers from our scientific community to upload relevant datasets and contribute the database growth. Towards the end, IMPACTDB will serve as a rich resource for the community, foster collaboration and standardization among researchers, leverage curated knowledge to offer biomanufacturing predictions, and expedite the bioprocess development and cost estimations.

Availability and Implementation.

IMPACT is freely available at https://impact-database.com. The code is open-source and can be accessed at https://github.com/impact-db/impact-db-client.