(599v) Fast Kernel Density Estimation Using Orthogonal Polynomials and Goertzel DFT | AIChE

(599v) Fast Kernel Density Estimation Using Orthogonal Polynomials and Goertzel DFT

Authors 

Ungarala, S. - Presenter, Cleveland State University


Process monitoring methods to identify abnormal conditions in the operations of manufacturing processes are generally based on statistical principles. Statistical methods such as principal component analysis (PCA) and partial least squares (PLS) have gained widespread acceptance within process industries (Qin, 2003). The fundamental statistical measure to describe the behavior of process data is a probability density function (pdf) of the data. The problem of estimation of the pdf from data is relevant to routine data analysis problems in many disciplines of engineering, statistics and economics. Methods of statistical process control and process monitoring rely on the availability of a knowledge of the data pdf, which is generally a non-Gaussian pdf in many real world situations.

The central problem in statistical process monitoring may be posed as: given a set of N random numbers, estimate the underlying pdf to be evaluated on a chosen set of N support points of the density. A non-parametric estimate of the pdf is given by the kernel density estimator (KDE). The KDE is implemented as a normalized sum of kernel functions of a chosen width, known as bandwidth, each of which is centered on the process data points. The standard Gaussian function is a commonly employed kernel, although the exact form of the kernel is of less importance to the accuracy of the estimated pdf than the choice of the bandwidth. For example, Safavi et al. (1997) used wavelets as kernel functions and Johnston and Kramer (1994) used elliptical basis functions. Since these two publications, there is remarkably little attention given to improving the KDE among the process systems engineering community, given the importance of the pdf estimate for applications in routine process operations. Recently however, there is some interest in the use of KDE for state estimation by moving horizon estimator (MHE). The objective function of the MHE includes an arrival cost term summarizing the uncertainty at the beginning of each horizon, which is traditionally derived from a Gaussian pdf assumption. The use of KDE is suggested by Ungarala (2009) and Lopez-Negrete et al. (2011) for computing the arrival cost from a non-Gaussian pdf estimate.

The KDE procedure is a convolution of the data with the kernel, assumed here as the Gaussian function, requiring computation in O(NN). Silverman (1982) showed that the fast Fourier transform (FFT) can be used to implement KDE efficiently in O(NlogN) than the canonical discrete Fourier transform (DFT) in O(NN), and the FFT based KDE has become the standard method since. While it is computationally efficient, accurate estimation of the pdf by KDE is critically dependent on the bandwidth. Small bandwidth results in small bias but large variance and the estimate may suffer from high frequency flutter. Large bandwidth yields small variance while increasing the bias and the estimate may smooth out local detail due to dominant low frequency components (Silverman, 1986). The selection of the bandwidth according to various optimality criteria have been proposed (Jones et al., 1996). One of the many techniques is aimed to minimize the asymptotic mean integrated error (AIMSE) between the estimate and the true pdf. However, the minimizing bandwidth depends on a functional of the unknown pdf's second derivative. A KDE procedure for the estimation of the second derivative requires an optimal choice of its own bandwidth. In essence, the optimal choice of the bandwidth for a KDE estimate requires an estimate of the second derivative of the unknown density, and the optimal choice of the bandwidth of the kernel density second derivative requires the estimate of the fourth derivative of the unknown density and so on (Hardle et al., 1990).

In this paper a novel approach is proposed for a fast iterative method for the selection of the AIMSE minimizing bandwidth by estimating the pdf's second derivative and the implementation of the Fourier transform is sped up by an algorithm faster that the FFT under certain conditions. The FFT has several variants that have gained a formidable number of applications. However, a judicious selection of other incarnations of the DFT can prove to be more appropriate in particular situations. The mandated computation of Fourier coefficients at all allowable frequencies by FFT is by no means the most efficient pathway if a problem requires coefficients only at  a ``few'' frequencies. In many pdf estimation problems, one often encounters pdfs that may be characterized by just a few low frequency harmonics, for example skewed Gaussian pdfs and dual or triple mode pdfs.  In this context, a computationally efficient alternative to the FFT is the Goertzel algorithm for DFT (Goertzel, 1958). The algorithm computes a single point DFT at a selected frequency using a second order recursive impulse response (IIR) filter. The iteration is performed on real numbers and the output is calculated using a single complex multiplication. The algorithm for M nonzero frequencies requires computation in O(MN). However, it only requires about one quarter of real multiplications and about one half of real additions compared to the canonical DFT of O(MN) complex multiplications. For M<logN, the Goertzel algorithm is more efficient than the FFT.

The proposed fast algorithm for optimal bandwidth is based on orthogonal projections of the pdf on families of orthogonal polynomials such as Gegenbauer polynomials. There are certain inherent computational conveniences in employing a polynomial approximation for the pdf estimate. For a given number of terms in the sums, evaluation of the polynomial requires only real multiplications and additions compared to complex multiplications in trigonometric polynomials or the Fourier series. The differentiation and integration of polynomials is also fast and straightforward. In the proposed fast algorithm, the pdf is at first assumed to be represented by a truncated trigonometric polynomial or a partial Fourier sum by computing only the first few low frequency Fourier coefficients using the Goertzel algorithm for each coefficient or the FFT if necessary. The smoothing factors of the Fourier coefficients are initialized by a pilot bandwidth chosen according to the Silverman's rule of thumb, with the assumption that the unknown pdf belongs to the Gaussian family of pdfs. The resulting summation of the low frequency components of the density may or may not yield an adequate representation of the density. As a post processing step, the density is then assumed to be a partial Gegenbauer polynomial sum, whose coefficients are recovered as linear combinations of the previously computed Fourier coefficients. The relationship between the polynomial coefficients and Fourier coefficients is a standard result exploited by researchers focused on the resolution of Gibbs phenomenon (Gottlieb and Shu, 1997). The second derivative of the density polynomial is then used to estimate the density functional in the minimum AIMSE bandwidth formula to determine an updated value of the bandwidth. The whole procedure is iterated by recomputing the smoothing factors. As the iterations continue, the assumption of an underlying Gaussian density is progressively weakened.

In order to test the performance of the proposed algorithm, the battery of test pdfs published by Marron and Wand (1992) are used. These are fifteen pdfs expressed as Gaussian mixture pdfs to mimic frequently encountered pdfs in real life situations. In numerical experiments it is seen that all of these pdfs can be recovered accurately from the computation of just a few low frequency Fourier harmonics of the data, i.e. M<<N/2 for large N. Furthermore, it is seen that the fast algorithm handles data sets of size N=150,000 easily, incurring computation time in the order of milliseconds on today's computers. In conclusion, the proposed fast polynomial based KDE is a contribution towards improving the estimation of non-Gaussian pdfs from data sets, which may be employed for statistical process monitoring applications.

References:

Goertzel, G (1958) An algorithm for the evaluation of finite trigonometric series. Amer. Math. Month., 65, 33-35.

Gottieb, D. and C. Shu (1997) On the Gibbs phenomenon and its resolution. SIAM Rev., 39, 644-668.

Hardle, W., J. S. Marron and M. P. Wand (1990) Bandwidth choice for density derivatives. J. R. Statist. Soc. B, 52, 223-232.

Johnston, L. P. M. and M. A. Kramer (1994) Prabability density function using elliptical basis functions. AIChE J., 40, 1639. 

Jones, M. C., J. S. Marron and S. J. Sheather (1996) A brief survey of bandwidth selection for density estimation. J. Amer. Statist. Assoc., 91, 401-407.

Lopez-Negrete, R., S. C. Patwardhan and L. Biegler (2011) Constrained particle filter approach to approximate the arrival cost in moving horizon estimation, J. Proc. Control, 21, 909-919.

Marron, J. S. and M. P. Wand (1992) Exact mean integrated square error. The Ann. Statist., 20, 712-736.

Qin, S. J. (2003) Statistical process monitoring: basics and beyond. J. Chemometrics, 17, 480-502.

Safavi, A. A., J. Chen and J. A. Romagnoli, Wavelet-based density estimation and application to process monitoring. AIChE J., 43, 1227-1241.

Silverman, B. W. (1982) Kernel density estimation using the fast Fourier transform. Stat. Algorithms, 65(1) 93-99.

Silverman, B. W. (1986) Density Estimation for Statistics and Data Analysis. Chapman & Hall, New York, NY.

Ungarala S. (2009) Computing arrival cost parameters in moving horizon estimation using sampling based filters. J. Proc. Control, 19, 1576-1588.