(549a) Data Dexterity at Rensselaer | AIChE

(549a) Data Dexterity at Rensselaer

Authors 

Hahn, J. - Presenter, Rensselaer Polytechnic Institute
Wang, X., Rensselaer Polytechnic Institute
Kruger, U., Rensselaer Polytechnic Institute
Given the increasing importance of data science in all aspects of engineering, Rensselaer Polytechnic Institute developed a Data Dexterity requirement for all undergraduate students to ensure a minimum level of competency when it comes to dealing with data. The requirement involves taking two data-centric courses where one course is the same for all students in a college and introductory in nature and the other course is department specific and more advanced. A task force was created by the School of Engineering (SoE) that collected feedback from a number of constituents on what aspects related to data are most important for engineering students. This presentation focuses on the recommendations of the task force, the implementation of the recommendation by several departments, and also some general concepts related to teaching fundamentals of data science that came out of these recommendations.

The charge of the SoE task force was to identify key (data/analytics) fundamentals, methods, techniques, and algorithms that should be taught to the School of Engineering undergraduate students, focusing primarily on the first data-intensive course that will affect all engineering students. While there was a recent related report by the National Academies [1], this report was mainly framed from a Computer Science perspective and so it was of limited use for deriving recommendations for engineers. Specifically, three major areas, including some subcategories, were identified:

  • Analytic foundations, which refer to the fundamental science that enables data to be described theoretically and analyzed empirically:

(a) Mathematical foundations (e.g., set theory, probability, optimization)

(b) Computational foundations (e.g., algorithms, data structures, simulation)

(c) Statistical foundations (e.g., uncertainty, error, modeling, experiments)

  • Data representation and communication, which refers to the way data are managed, modeled and integrated into workflow

(a) Data management and curation (e.g., data preparation, management, privacy, cleaning, database design)

(b) Data description and visualization (e.g., consistency, exploratory data analysis, visualization, dashboards)

(c) Data modeling and assessment (e.g., machine learning, sensitivity analysis, interpretation)

(d) Workflow and reproducibility (e.g., provenance, documentation, version control, collaboration)

(e) Communication and teamwork (e.g., needs analysis, reporting, presentation)

  • Ethical use of data, which includes a discussion of privacy/confidentiality of data as well as the idea that having an unbalanced or biased data set leads to skewed and/or biased analysis and outcomes (i.e., the data equivalent of garbage-in-garbage-out)

One of the outcomes of the recommendations of this task force was that the course “Modeling and Analysis of Uncertainty” (MAU) that is required of all engineering students in their 2nd year was revised to provide an introduction and broad foundation for working with data. This course focuses on the probabilistic and statistical concepts required for data analysis and also includes a project where students have to apply a variety of the concepts to a real-world data problem provided to them. Courses offered by individual departments then build upon the foundations laid in MAU. For example, the Biomedical Engineering department offers a mandatory class on Modeling of Biomedical Systems that teaches modeling, numerical methods, parameter estimation, and an introduction to feedback control in addition to integrating data analysis in the key laboratory course. Furthermore, the BME department has an (optional) specialization in Biomedical Data Science that requires the students interested in a Data Science Certificate to take courses on Biostatistics for Life Science Applications as well as Biomedical Data Science. In these classes, the students get exposure to univariate as well as multivariate approaches for dealing with data and include an introduction into data mining, machine learning and deep learning. In order to obtain a certificate for Biomedical Data Science, the students have to take these two courses and an additional two data-specific electives from a list of seven courses, e.g., Design of Experiments, Electronic Instrumentation, Modeling and Control of Dynamic Systems, etc.

Aside from the general curriculum, the discussion of implementing Data Dexterity at Rensselaer also included a number of other topics. For example, while it was generally acknowledged that “programming is not data science”, programming is a fundamental concept to perform any meaningful type of data analysis: large data sets are usually contained in databases and extracting information from databases and transforming the data into a form suitable for analysis requires some programming skills. Furthermore, even if one does not deal with large data sets then any analysis that goes beyond performing simple, software-supplied functions requires a minimum level of programming. A result of this is that students in most engineering departments at Rensselaer now have to take a three-credit hour programming classes as part of the curriculum.

Another point of discussion was the need for “hardware/software infrastructure” for the students to be able to engage in data science. Since Rensselaer is a private school where all students are required to have their own laptop and a number of software packages (MATLAB and toolboxes, Mathematica, Maple, etc.) are provided to them for free, this point was of lesser concern here, but may be a larger part of discussion elsewhere. It should be noted that a number of software packages such as R, Python, or Jupyter Notebook can be helpful for data science and can be downloaded for free.

The last major point of discussion was related to soft skills surrounding data. For many businesses, use of their data represents a major portion of their business model. As such confidentiality when dealing with some data sets is essential. Like-wise if data deal with private or health-related information then there are significant federal mandates related to data confidentially. Finally, any conclusions drawn from data can only be as good as the quality of the data collection, i.e., incomplete, non-independent or biased data collection will result in incomplete, non-representative or biased interpretation of the analyzed data.

References:

[1] National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. https://doi.org/10.17226/25104.

Topics