(459c) Experiences in Teaching Statistics and Data Science to Chemical Engineering Students at the University of Wisconsin-Madison
AIChE Annual Meeting
2024
2024 AIChE Annual Meeting
Computing and Systems Technology Division
Teaching Data Science to Students and Teachers
Wednesday, October 30, 2024 - 4:00pm to 4:15pm
Statistics is one of the pillars of modern science and engineering and of emerging topics such as data science and machine learning; despite of this, its scope and relevance has remained stubbornly misunderstood and underappreciated in chemical engineering education (and in engineering education at large). Specifically, statistics is often taught by placing emphasis on data analysis. However, statistics is much more than that; specifically, statistics is a mathematical modeling paradigm that complements physical modeling paradigms used in chemical engineering (e.g., thermodynamics, transport phenomena, conservation, reaction kinetics). Specifically, statistics can help model random phenomena that might not be predictable from physics alone (or from deterministic physical laws), can help quantify the uncertainty of predictions obtained with physical models, can help discover physical models from data, and can help create models directly from data (in the absence of physical knowledge).
The desire design a course on statistics for chemical engineering came about from my personal experience in learning statistics in college and in identifying the significant gaps in my understanding of statistics throughout my professional career. Similar feelings are often shared with me by professionals working in industry and academia. Like many chemical engineers, I took a course in statistics in college that covered classical topics such as random variables, descriptive statistics, regression, design of experiments, and basic probability. This course, while I found it to be interesting, felt disconnected from the rest of the chemical engineering curriculum. Specifically, with the exception of linear regression and design of experiments, I did not encounter other major uses of statistics in the curriculum. This left me with a perception that statistics was an intellectual curiosity. I feel that similar situations can be encountered when students take machine learning or data science courses that have no connections to physics or chemical engineering concepts. Throughout my professional career, I have been exposed to a broad range of applications in which knowledge of statistics has proven to be essential: uncertainty quantification, quality control, risk assessment, modeling of random phenomena, process monitoring, forecasting, machine learning, computer vision, and decision-making under uncertainty. These are applications that are pervasive in industry and academia. It is also important to recognize that the field of statistics has evolved and new concepts/tools have become available; for example, the fields of uncertainty quantification, Bayesian analysis, statistical learning, and decision-making under uncertainty have experienced significant growth in recent years.
I believe that there is a need to modernize the scope and approach to teach statistics. This should be done carefully, by finding a suitable blend of classical and new topics, by finding points of connection with the rest of the chemical engineering curriculum, and by thinking about what are the unique skills and interests of chemical engineers. Statistics is typically taught by mathematics/statistics departments to a broad range of engineers; as such, it might be difficult for instructors to make standardize material and make explicit connections to applications that are of interest to chemical engineers. Specifically, it is important to understand that chemical engineers are trained to use physical knowledge (e.g., thermodynamics, reaction kinetics, transport phenomena) to analyze systems and to make decisions (e.g., design an experiment or a chemical process). As such, when teaching statistics, it is important to emphasize when and how these tools can complement (or substitute) physics.
Moreover, it is important to remember that a key skill of chemical engineers is the ability to develop mathematical abstractions (models) to analyze complex systems, and by "complex I mean systems that involve heterogeneous phenomena (e.g., reaction, flows, heating/cooling, separation) and that might involve different scales (e.g., molecular level, unit level, process level, enterprise level). As such, when teaching statistics, it is important to emphasize how this can provide tools to facilitate the modeling/understanding of complex systems (e.g., a chemical process that generates large amounts of data). It is also important to remember that, when chemical engineers analyze data, they are ultimately interested in understanding phenomena and in extracting underlying principles; in other words, engineers aim to attribute physical meaning/origin of behavior that helps them make decisions (e.g., design a new material or microbe to conduct a function). As such, when teaching statistics, it is important not to discard the physical and decision-making context. Finally, with the advent of machine learning and data science, it is important to remember that statistics (together with calculus and linear algebra) provide key mathematical fundamentals that are the building blocks of these tools.
The course that I designed at UW-Madison (and the accompanying textbook) follows a "data-models-decisions" pipeline. The intent of this design is to emphasize that statistics is a modeling paradigm that maps data to decisions; moreover, this design also aims to "connect the dots" between different branches of statistics. The focus on the pipeline is also important in reminding students that understanding the application context matters. For instance, the data and type of model used for process design can be quite different than the data and type of model used for experimental design. Similarly, the nature of the decision and the data available influence the type of model used. The design is also intended for the student to understand the close interplay between statistical and physical modeling; specifically, we emphasize on how statistics provides tools to model aspects of a system that cannot be fully predicted from physics. The design is also intended to help the student appreciate how statistics provides a foundation to a broad range of modern tools of data science and machine learning (e.g., neural networks and logistic regression). More broadly, the course/book emphasizes on how statistics provides a way to think about the world. For instance, we discuss how statistical thinking is fundamentally different from deterministic thinking (which ignores uncertainty). Making this distinction is extremely important, as chemical engineering courses tend to be taught using a deterministic mindset (e.g., there is no uncertainty/variability in the data and the model used for analysis is perfect). In this context, the course/book discusses how ambiguity can arise when one faces uncertainty, as key variables of interest (e.g., cost and carbon emissions) are no longer single numbers (they are distributions) and thus cannot be compared so easily. Moreover, we discuss how statistics provides tools that can help make decisions that mitigate/control uncertainty.
The talk also offers insights into experiences in using software, as a way to reduce complex mathematical concepts to practice. Moreover, I discuss how statistics provides an excellent framework to teach and reinforce concepts of linear algebra and optimization. For instance, it is much easier to explain the relevance of eigenvalues when this is explained from the perspective of data science (e.g., they measure information).