(121c) Accelerating Product Development with Diverse Training Data | AIChE

(121c) Accelerating Product Development with Diverse Training Data

Authors 

Kroenlein, K. - Presenter, National Institute of Standards and Technology
Bernasek, S. M., Citrine Informatics
Kubie, L., Citrine Informatics
When designing a product, experts meld a variety of data including historical experiments, manufacturability limitations, and fundamental physical and chemical understanding. The breadth of these data resources has made awareness of all relevant information for a given design difficult, and the growth of data volumes following substantial digitalization efforts has exacerbated this challenge. Combining these disparate data streams is labor intensive, as differences in schema, assumptions about format, and variations in taxonomy make merging without human intervention often impracticable — even without considering lab notebooks or other non-digital assets. These data are heterogeneous in structure, sparsely populated, and often statistically small.

The approach Citrine Informatics has taken for both storing all of this connected data and then leveraging it for AI is one of divide and conquer. First, we use the Graphical Expression of Materials Data (GEMD [1]) model to provide structure and detailed information about process history to data sources using partner-defined terminology. This allows comparison and synthesis across data sources without forcing complex records into a rigid schema. Second, we use graph queries defined in our citrine-python library [2] to normalize data into a tabular format with consistent units. The queries are expressed using the same organization-defined terms from GEMD. Third, as the heterogeneity of the data means that not all values will be defined for all rows, we use networks of models to fill empty cells through transfer learning [3]. Finally, in model validation we use leave-one-cluster-out cross validation [4] to develop reasonable uncertainty expectations for the system of models in light of the population imbalance common to industrial data. Combining these methods into a unified data and modeling stack has resulted in a tool that can engage with data sources where they are today, allow for forward compatibility as data continues to accumulate and evolve, and permit reuse and retraining of historical models with minimal human intervention.

[1] https://citrineinformatics.github.io/gemd-docs/

[2] https://citrineinformatics.github.io/citrine-python/

[3] M Hutchinson, E Antono, B Gibbons, S Paradiso, J Ling, and B Meredig. Overcoming data scarcity with transfer learning. arXiv preprint arXiv:1711.05099, 2017

[4] B Meredig, E Antono, C Church, M Hutchinson, J Ling, S Paradiso, B Blaiszik et al. “Can machine learning identify the next high-temperature superconductor? Examining extrapolation performance for materials discovery.” Molecular Systems Design & Engineering (2018).