(64b) Challenges and Benefits of Aligning and Reconciling Process Data for Seasonal Processing Industries
AIChE Spring Meeting and Global Congress on Process Safety
2017
2017 Spring Meeting and 13th Global Congress on Process Safety
3rd Big Data Analytics
Big Data Analytics and Statistics I
Tuesday, March 28, 2017 - 8:30am to 9:00am
Over twenty years ago, the advantages of fully exploiting captured data was recognised, and mathematical optimal strategies for producing a reconciled and consistent data set were proposed by a number of authors. Typically these strategies were cast as constrained optimisation problems, where the intent was to produce a set of adjusted process measurements that satisfied known mass and energy balance constraints. While academic interest seems to have waned somewhat, such approaches are still relevant, and plausible even with the large data sets routinely generated today. However simply constructing a reconciled data set is only a first step in what is potentially possible.
This work describes the construction of a reconciled and aligned database of three separate large-scale milk powder plants in New Zealand. The production of milk powder has some special challenges, compared to say bulk chemicals. First, given that the raw material (from cows) is a biological product of eating grass, in different geographical locations, there are many uncontrolled, and strongly correlated seasonal variables. Second, milk production in New Zealand is largely seasonal and typically only operates for 9 months of the year with an annual 3 month period used for sometimes significant plant upgrades or changes. Consequently, data captured in one season could be very different to the next, due to seemingly small plant changes. Finally, quantification of the end-properties of the milk powder are subtle, challenging to measure repeatably, and can be hard to relate to production conditions due to the complex multivariate nature of the underlying physical-chemical relations.
This motivation behind the creation of this database is to investigate the potential of âReal Time Qualityâ in the production of Instant Whole Milk Powder (IWMP). Increasingly across the dairy industry, the focus of producers is shifting from maximising production to maximising quality, and higher-value milk powders and premium products, which necessarily also have higher requirements in terms of performance and composition. This covers a very broad range of attributes, including dozens of measurements each of physical, functional, sensory, and microbiological factors, not all of which are explicitly controlled at the time of manufacture. Functional properties, such as the dissolution behaviour, taste, or texture - given their inherent qualitative characteristics - are challenging to control and quantify accurately, and they are often not tested regularly. The determining physical causes are not always well known, and plant operators may rely on rules of thumb, or may not even have a chance to affect functional property outcomes if the test results are not timely. However it is advantageous that future quality control is performed in real time to prevent many tonnes of off-specification powder being produced before detection by the infrequent and delayed offline measurements.
It was hypothesised that multivariate regression utilising a very broad and deep dataset, containing many process and quality variables from several different powder plants producing the same premium product, across several years of production seasons, may give sufficiently varied input data to allow prediction of powder functional properties, suitable for real time decision making. This avoids the problem of insufficient excitation in the raw data from plants operating at steady-state to make meaningful models. This firstly required construction of a dataset, which was a vast undertaking to combine and align many sources of data, spread over time, geography, and changes in plant design and operating methods.
This approach distinguishes between three distinct types of measured data: X data, which comprise of the standard process measurements such as temperature, pressure, and flow; Y data, which comprise of at-line hourly measurements of in-process powder physical properties; and Z data, which are the final powder functional properties. The X data are typical measurements that exist in any process plant, however Y data are key physical powder measurements, such as fat, protein, and moisture content. The Z data are typically prescribed by customers or regulatory bodies, and may be challenging to change or improve, however Y data may be expanded with newly introduced measurements, such as bulk density or particle size distribution, which it is hypothesised will improve prediction of functional properties by bridging the gap between measured process data and final powder properties.
These three data groups are stored in different databases, and not always with clear methods to cross-reference them. A further complicating issue is that the powder transport chain also creates a significant challenge where the actual production time of any Z-data quality sample taken from packed powder bags cannot be confidently known, due to holdup time and mixing of powder from large blending bins and silos.
Construction of the database required close examination of the specific operating and sampling processes at each plant, and their associated data storage methods. This uncovered a large number of special cases in the data streams, which must be detected and cleaned up in every case, which requires time consuming manual programming, and is not easily generalised to other plants. This is a fundamental feature of many types of industrial data however, and the cleaning up of data idiosyncrasies is a key requirement to create a useful dataset, which would greatly benefit from development of advanced or smart data processing methods. It is also, in our experience, a key reason why properly aligned data is so rarely constructed.
Missing data is a further key challenge to surmount, as the routine plant cleaning cycles cause a significant fraction of plant data to be out of range at all times. For example in a plant with four parallel milk evaporator trains, one or more are always out of service for cleaning, and the process temperatures recorded are either the ambient temperature, or that of the cleaning fluids, and must be removed from the data set when comparing process conditions. Missing data can be imputed, which may have an effect on the behaviours observed, or alternatively samples with any missing data in their row can be dropped. However the dropping of entire rows causes severe data reduction at times, up to 100% of the data in plants with parallel unit operations where one is always offline, and clearly such an approach is untenable. Consequently other methods have been proposed such as creating composite variables from parallel unit operations, with some success.
Carefully framing the questions to be asked of the data is a key factor in preventing unnecessary data reduction, by informing the optimal data arrangement or imputation. This planning was also vital in selecting which parts of the data to use for predictive modelling. Changes in plant equipment or operating methods cause structural changes in the data set and change or break underlying relationships. For example, the full data set can be used to find similarities and differences between the three plants producing the same high-value product, however there are significant differences in plant design between them, and they do not have the same relationships between process conditions and product quality. Product quality is much better investigated when using data from only one plant at a time, and produces models which may be used to inform process conditions particular to that plant.
While significant work has been undertaken in prediction of functional properties, there is also great value produced solely by acquiring and organising this dataset, which had not been previously undertaken due to the difficulty and effort required. Visualisation of this plant data using novel methods or arrangements can elucidate many behaviours, differences and similarities between the different plants, products, and times of the season. Data visualisation is a rich subject with great value, which is rarely in our experience ever well explored or exploited by plant process control engineers. In order to generate information from rich visualisation however, of course the appropriate dataset needs to be collected and aligned, which may be a prohibitive undertaking without careful planning, design and intelligent processing methods.
The paper will illustrate uses of this data set, showing visualisation strategies for time-varying multivariate models, and how plant engineers can quickly assess the current plant status compared to past runs, as well as work in finding predictive models for use in Real Time Quality production.