(375n) Large Language Models for Discovering Equations | AIChE

(375n) Large Language Models for Discovering Equations

Authors 

Josephson, T. R., University of Maryland, Baltimore County
Large Language Models (LLMs) are transformer-based machine learning models that have shown remarkable performance in tasks they were not explicitly trained on. We explore the potential of LLMs to perform symbolic regression by finding closed-form interpretable models from observational data in physical sciences. Symbolic Regression (SR) is a machine learning technique that searches through a “space of possible equations” to identify those that balance accuracy and simplicity for a dataset. In this work, we designed an iterative workflow using GPT-4 as a symbolic regressor. We instruct GPT-4 to suggest expressions from data through prompting, which are evaluated for complexity and loss and then sent back to suggest better ones, optimizing for both. We show how strategic prompting improves GPT-4's performance, and our observations indicate that the model can identify target model expressions when they are concise and contain basic math operators. Although GPT-4 does not outperform established SR programs where target model equations are complex with low loss, these machine-learning models require zero training and minimal programming knowledge removing barriers in interdisciplinary research. Additionally, working with natural language makes integrating background knowledge with data in SR a straightforward process.