(266g) An Information Entropy Based Consistency Index for Evaluating the Performance of Variable Selection Methods | AIChE

(266g) An Information Entropy Based Consistency Index for Evaluating the Performance of Variable Selection Methods

Authors 

He, Q. - Presenter, Tuskegee University

An information entropy based consistency index for evaluating
the performance of variable selection methods

Q. Peter He

Department of Chemical Engineering,
Auburn University, Auburn, AL 36849, USA

Data-driven soft sensors have been widely used in both
academic research and industrial applications for predicting hard-to-measure
variables or replacing physical sensors to reduce cost. It has been shown that
the performance of these data-driven soft sensors could be greatly improved by
selecting only the vital variables that strongly affect the primary variables,
rather than using all the available process variables.

In the past few decades, many different variable selection
approaches have been reported for various applications with different soft
sensor modeling methods. In order to evaluate the performance of different
variable selection methods, several performance indices have been proposed in
the literature. The most common ones are the average mean absolute percentage
error (MAPE), coefficient of determination (R2), and geometric mean of
selection sensitivity and specificity (G). Among them, only G directly measures
the accuracy of variable selection results, while MAPE and R2 indirectly
measure the effects of variable selection through the prediction performance of
a soft sensor, such as PLS. However, when the information on the true relevant
variables is not available, which is the case for most industrial applications,
selection sensitivity and specificity (therefore G) cannot be obtained and
there is no direct metric existing for variable selecting in the literature.

We recently reported an entropy based variable selection
index to access the variable selection performance, which does not require the
ground truth of variable relevance ADDIN \s
<Colwiz><citation><biblioref linkend="f20f9a604c9f9c2"
citekey="wang2015comparison" /></citation></Colwiz> [1]. The index evaluates the
consistency of the variable selection performance, and is termed consistency
index. It was shown that the consistency index describe the variable selection
performance well for both simulated (with ground truth) and industrial (without
ground truth) cases studies. However, the consistency index does not fully
agree to a common expectation: the consistency is the lowest when a variable is
being selected for a model 50% of the time. Instead, the minimum (the lowest
consistency) occurs at probability of 0.3679 (see Figure 1); in addition, the
consistency index cannot make use of the ground truth even when it is
available.

To address these limitations, in this work we propose a
modified consistency index based on information entropy. It has a symmetric
response curve as shown in Figure 2, and can be applied to all cases – no
matter the information on the true relevant variable is available or not. Simulated
and industrial cases studies are provided to compare the performance of the
proposed index to the existing indices.

Figure 1. Probability vs. consistency based on [1]

Figure 2. Probability vs. consistency based on the proposed
index

References:

ADDIN 
\s <CWZ.BIB></CWZ.BIB>

[1]        Z. X. Wang, Q. P. He, and J. Wang,
“Comparison of variable selection methods for PLS-based soft sensor modeling,” J.
Process. Control.
, vol. 26, pp. 56–72, 2015.