(173am) A Novel Application of the Newcomb-Benford Law to Exposure Data | AIChE

(173am) A Novel Application of the Newcomb-Benford Law to Exposure Data

The Newcomb-Benford Law (NBL) is a mathematical concept that describes the distribution frequency of the significant digits for stochastic processes (Li, Fu, and Yuan 2015). It was first noted by Simon Newcomb in 1881 (Newcomb 1881) and later revived by Frank Albert Benford in 1938 (Benford 1938) who applied the concept to a large sample of data sets. The NBL suggests that the probability for the significant digits (1 through 9) of a randomly occurring number is not equally distributed (Equation 1), as one would consider by chance ie.if every digit were to demonstrate equal probability, each digit should have a relative frequency of 11.1% (ie each of the 9 digits have equivalent probability). However, the NBL describes a distribution where 1 has a higher relative frequency than 2 and so on.

The use of the NBL has found widespread applications across many fields even though the underlying reasons for its prevalent applicability still remain somewhat of a mystery (Li, Fu, and Yuan 2015). Particularly, the NBL has been widely used to identify fraudulent or manipulated data in the political (voter fraud; Pericchi and Torres 2011), economic (tax evasion; Demir and Javorcik 2020), and natural (data manipulation; Hullemann, Schupfer and Mauch 2017) sciences. However, few studies have used the NBL in exposure sciences. Furthermore, to the best of the author’s knowledge, there have been very few applications of the NBL to determine the potential presence of underlying driving forces, such as homeostasis or hormesis, instead of its use in identifying data manipulation.

As described by Benford and pointed out by Brown (2005), some data will inherently violate the distribution, namely, (1) data that are confined by upper or lower limits, (2) experimental designs that only allow for only a few discrete values, (3) in systems where processes are present that influence the equilibrium or regulation to a value. Most fields have used the law to detect manipulated data and therefore this practice could result in falsely labeling data sets as manipulated when in actuality they are being driven by underlying mechanisms such as toxicokinetic and toxicodynamic (TK-TD) factors, seasonality influences, etc. as suggested by point (3).

We applied NBL to five different exposure data sets (Table 1) with the aim of understanding how TK-TD of essential, beneficial, and nonessential elements may influence conformity to the Newcomb-Benford distribution (NBD). We hypothesize that essential and beneficial elements will violate the NBD due to underlying physiological mechanisms that favor uptake and retention under normal conditions (observational studies), while non-essential elements will follow the NBD. We further hypothesize that, in experimental data sets (MICE) involving administration of elements, the data will violate the NBD due to the unnatural nature of dosing studies that manipulate the normal balance.

Data sets used in the present investigation were generated from various projects (Pino et al 2012, 2018; Alimotiet al n.d.; National Statistics Office of Georgia 2019; Ruggieri, Alimonti and Bocca 2016). The methods for measurement were based on Sector Field Inductively Coupled Plasma Mass Spectrometry (SF-ICP-MS) and followed the methods of Ruggieri,Alimonti, and Bocca (2016).

Data were processed using the ‘BenfordTests ' package in R (v1.2.0; Joenssen 2015). Preliminary analyses assessed the applicability of various methods within the ‘BenfordTests’ package to the adult PROBE (PROBE_1) data set to determine the most suitable test for comparisons. Tests assessed included: Pearson’s Chi-squared Goodness-of-Fit Test, Euclidean Distance Test, A Hotelling T-square Type Test, Joenssen’s JP-square Test, Kolmogorov-Smirnov Test, Chebyshev Distance Test (maximum norm), Judge-Schechter Mean Deviation Test, and Freedman-Watson U-square Test.

Findings on the test comparisons suggest that most (7 of 8) tests are highly sensitive to small deviations from the BD; all but one test resulted in violations of the NBD (p < 0.00001). The least sensitive test was identified (Judge-Schechter Mean Deviation Test; Judge and Schecter 2009) and was selected to allow for quantifiable comparisons between violations of data sets.

All data were analyzed using the Judge-Schechter Mean Deviation Test for Benford’s Law. This method reduces data points to a specified number of significant digits and then performs a goodness-of-fit test between the measured deviation means of first digit distributions and the theoretical ideal distribution of the NBD. Assumptions of this equation are that values should be continuous and that digits should be chosen so that the number of significant digits is not influenced by previous rounding. Further information can be found in the ‘BenfordTests’ package.

The J-S Mean Deviation Test requires that data are not confined by previous rounding; therefore, measurements found to be at or below the limit of detection (LoD) for each element and analytical method were removed from the data sets before analysis. Mean digit tests were not confined by digit number restraints and were performed using asymptotic analysis, as defined by the package. Data visualizations are provided as relative frequencies with 95% confidence intervals. Datasets for PROBE_1, PROBE_2, BURLO, and UNICEF were analyzed in three groups split by elemental classifications defined in Table 1 and also all together (All). According to the hypothesis, when all data points for all of the measured elements are combined, they should follow the NBD. Once divided into their respective groupings (essential, beneficial, and non-essential), we hypothesize that the data should violate, potentially violate, and follow the NBD, respectively.

Across the datasets, blood appeared to be the most reliable matrix to assess the current hypotheses. Urine and hair samples did not appear to display any accordance with the hypothesis; however more datasets are required to confirm this hypothesis (data not shown). This outcome could be due to the non-dynamic nature of these sample types compared to blood. After consideration of the biology of the sample types, more thorough analyses were conducted for the blood (PROBE_1, PROBE_2, UNICEF). After analysis of the PROBE_1 data (adult), it was observed that in All and Non-essential groupings, the data closely reflected a NBD (p = 0.202 and p = 0.605, respectively). Alternately, the essential and beneficial elements violated the NBD, as expected (p < 0.00001 for both). However, there appeared to be a non-random frequency distribution in the Essential and Beneficial results; the resulting trend of the data for these two sets appeared to follow a rotated sigmoidal curve, where the first digit of 1 resulted in a relatively high frequency, followed by a drop for digits 2 to 5, and ending with a slight increase and levelling off of frequencies from 6 to 9.

The PROBE_2 (adolescent) Essential data demonstrated a similar rotated sigmoidal curve and violation of the NBD (p = 0.000987). Conversely to PROBE_1 data, the remaining groupings of All, Non-essential, and Beneficial did not appear to demonstrate either random or fully Benford distributions (p < 0.00001 for each). For UNICEF data, All data followed the NBD (p = 0.945) closely, while Beneficial data depicted the previously described rotated sigmoidal curve (p = 0.00366), and the Non-essential data appeared not to follow either the NBD or another predictable trend (p < 0.00001). It could be noted that the differences in metabolic activity and energy demands between growing children/adolescents and adults could be the reason for the differences in these three data sets, particularly when discussing the unclear conformance of the non-essential elements in the children/adolescents. Other differences could be due to the small sample sizes and the uneven amount of elements between the groups.

When adherence to the NBD was compared within the blood biomonitoring data and between data sets (PROBE_1, PROBE_2, and UNICEF), two of three data sets followed the NBD in cases of nonessential elements, as was expected in the previous hypotheses. In the case of beneficial elements, there was inconsistent adherence with the three characterized curve shapes ie each depicted a different shape. However, both essential elemental groups analyzed (only for PROBE_1 and PROBE_2 data) demonstrated what appeared to be a characteristic rotated sigmoidal curve, which could be indicative of an underlying mechanism and could indicate active uptake of an element. This type of visual inspection could be indicative of a difference between falsely manipulated data and data with an underlying influencing mechanism. Previous research has identified any violation of the NBD as indicative of data manipulation. The use of violation of the NBL to determine manipulated data may be misunderstood and requires the consideration of whether the data in question is subject to an underlying driving force in the study system/subject.