(246c) Simultaneous Canonical Polyadic Decomposition As a Data Fusion Algorithm to Develop Pseudo-Chemistry from Spectral Data
AIChE Annual Meeting
2019
2019 AIChE Annual Meeting
Topical Conference: Applications of Data Science to Molecules and Materials
Applications of Data Science in Catalysis and Reaction Engineering I
Tuesday, November 12, 2019 - 8:36am to 8:54am
Simultaneous
Canonical Polyadic Decomposition as a Data Fusion Algorithm to Develop
Pseudo-chemistry from Spectral Data
Keywords:
Data Mining, Tensor Decomposition, Data Fusion, Bayesian Networks, Reaction
pathway generation
In-line
spectral analyzers are popularly used to obtain molecular-level information as
they are fast, non-invasive, non-destructive, inexpensive and do not require
sample preparation. The process data from spectral analyzers are high
dimensional, multi-way, non-causal, non-full rank and have missing values. This
offers a challenge in using such process data to develop causal models that
hypothesize reaction pathways. This work uses the spectral datasets from
Fourier Transform Infrared (FTIR) and Proton Nuclear Magnetic Resonance (1HNMR)
spectroscopy from the vis-breaking process of Cold Lake bitumen. The absorbance
data over wavenumber/ chemical shifts are collected across different
temperatures and residence time of reaction of the samples in the vis-breaker.
The two spectral measurements collected across the three modes viz.
temperature, residence time and absorbance/ chemical shifts represent multi-way
tensorial blocks of process data. The objective of this work is to develop a
data fusion framework that does not discount the tri-linear structure of the
tensorial blocks of spectral measurements, accounts for the missing spectral measurements
and the high-dimensionality of absorbance data across wavenumbers/ chemical
shifts while aiming to incorporate complementary information about the sample
from both. This is accomplished using the scheme of simultaneous canonical
polyadic decomposition (CPD) where the tensorial blocks of spectral data are
jointly factorized into independent factors in each mode while capturing
intermodal interactions during the decomposition, resulting in a unique
factorization scheme that is free from rotational and intensity ambiguities [1].
The
jointly mined process data using the data fusion algorithm is then used to
develop inferential models for monitoring the complex process of vis-breaking
by developing pseudo-reaction networks that hypothesize chemical pathways [2].
It is to be noted that the difficulty in analytical characterization of the
reacting mixture means that even the enumeration of all the major species
taking part in reactions is a significant challenge for conventional methods. This
is done using a probabilistic framework wherein the independent factors of a
mode obtained from the simultaneous CPD are viewed as random variables with a
multinomial distribution (the hyperparameters of which have a Dirichlet
distribution). Bayesian networks, which are probabilistic graphical models that
encode directed acyclic causal structures among the nodes of these random
variables are used to develop inferential hypotheses about chemical pathways
from the factors obtained from data fusion of spectral measurements. A directed
path exists between nodes if it maximizes the log likelihood, which is a
function of the mutual information and entropy, calculated using the
probability distributions of the random variables designated as nodes (factors
obtained from CPD). This amounts to using a score called the Bayesian information
criterion (BIC), which is the log likelihood of the entire network (pairwise
directed edges between nodes) penalized by the complexity of the network
(number of edges between nodes). Heuristic greedy search score-based methods that
make locally optimum choices while checking to see if directed edges between
pairwise nodes maximize the penalized log likelihood are used to obtain the
Bayesian networks i.e. the directed acyclic graph (DAG) encoding causal
relationships among the factors obtained from simultaneous CPD. During data
fusion using the simultaneous CPD, the independent factors in each mode are
constrained to be non-negative so that the decomposition is physically
meaningful by complying with the Beer Lambert law for spectral data, which
states that the absorbance is directly proportional to the concentration of the
components. Hence the independent factors in each mode from the data fusion
algorithm can be physically interpreted as representing a class of chemical
compounds (pseudo-component). The trilinear tensorial decomposition helps us
obtain the concentration of these pseudo-components in the modes of temperature
and residence time; while the third mode of wavenumbers/ chemical shifts
contain the spectral signatures of the corresponding pseudo-components.
The
number of independent factors in CPD is obtained using an important diagnostic
called the core consistency diagnostic [3],
whereby the number of factors in a CPD is obtained by fitting PARAFAC models
with arbitrary factors and then casting them into Tucker models, which should
produce an identity hypercube if the right number of components are used. This
technique exploits the fact that tensor decompositions are higher order
extensions of matrix singular value decompositions (SVD) and broadly fall into
two categories: Parallel factor decomposition (PARAFAC), which represents a
tensor as a sum of several rank 1 tensors, and Tucker decomposition, which is a
higher order SVD. The main difference between CPD using PARAFAC and Tucker is
that the number of factors is invariant across the modes in the former, making
it a restricted version of the latter. When this diagnostic was used on the
tensor blocks of FTIR and 1HNMR data in this
work, the number of factors obtained was four. The spectral signatures of the
four pseudo-components obtained from the data fusion algorithm (Fig. 1) were
then used to develop Bayesian networks (Fig. 2).
Figure 1. Spectral signatures of
pseudo-components obtained from the data fusion algorithm
Figure
2. Bayesian networks developed from the spectral signatures of the pseudo-components
(PC)
The
analysis of the spectral signatures reveal the following: pseudo-component 1
(PC1) mainly consists of carbonyl groups and cycloalkanes;
pseudo-component 2 (PC2) consists of polyaromatics, alkoxy groups,
phenols, alkenes; pseudo-component 3 (PC3) consists of aromatics,
alkanes and condensed products; and pseudo-component 4 (PC4)
consists of phenols, acyls and condensed aromatics. It
can therefore be hypothesized from the above Bayesian network structure that
the underlying chemical reaction pathways during the vis-breaking of bitumen
aim at obtaining more saturated end products through the free radical mechanism
of hydrogen radical addition. However, the longer chain aliphatics crack to
give more condensed polyaromatic products which are undesirable even as the end
products of thermal cracking have a more aliphatic nature (alkanes and
olefins).
The
novelty of this work lies in the implementation of a constrained simultaneous
CPD of multiple tensors using an optimization approach [4]
of solving for the decision variables (factors) using the gradients of the
objective function, which is formulated as the reconstruction error of the
tensors from their multi-modal factors. This is an improvement over the
Canonical Polyadic Alternating Least Squares (CPALS) approach, typically used for
the unconstrained decomposition of just one tensor block. CPALS is not very
accurate as it is not guaranteed to converge to a stationary point. Besides,
the simultaneous CPD developed in this work is designed to handle missing data
by imputing them by a weighting matrix in the objective function of the
optimization framework. The causal DAG among the factors obtained from jointly
decomposing both the tensors (FTIR and 1HNMR
measurements) is developed using Bayesian networks and
is representative of the underlying chemical pathways among the
pseudo-components. This work facilitates jointly mining spectral measurements
in the framework of constrained data fusion to make the factors physically
interpretable so that a first pass to building causal inferential models to
generate reaction network hypotheses from process data (spectral measurements) could
be implemented.
References
[1] T.G. Kolda and B. W. Bader, Tensor Decompositions and Applications , SIAM
Rev., vol. 51, no. 3, pp. 455500, 2009. [2] D. T. Tefera, L.
M. Yañez Jaramillo, R. Ranjan, C. Li, A. De Klerk, and V. Prasad, A bayesian learning
approach to modeling pseudoreaction networks for complex reacting systems:
Application to the mild visbreaking of bitumen, Ind. Eng. Chem. Res.,
vol. 56, no. 8, pp. 19611970, 2017. [3] R. Bro and H.A.
Kiers, A new efficient method for determining the number of components in
PARAFAC models, J.Chemometrics, Vol.17,274-286, 2003. [4] E. Acar, D. M.
Dunlavy, and T. G. Kolda, A scalable optimization approach for fitting canonical
tensor decompositions , vol. 25, no. 2, pp. 6786, 2011.
Checkout
This paper has an Extended Abstract file available; you must purchase the conference proceedings to access it.
Do you already own this?
Log In for instructions on accessing this content.
Pricing
Individuals
AIChE Pro Members | $150.00 |
AIChE Graduate Student Members | Free |
AIChE Undergraduate Student Members | Free |
AIChE Explorer Members | $225.00 |
Non-Members | $225.00 |