Thursday, March 26, 2009

Sorting out junk statistics

It takes a long time to sort out junk statistics. What will kill Matheronian geostatistics is that its proponents are drifting into the study of climate change. That’s why I've put together a sort of standardized rant. It pays to work on various ISO Technical Committees! Here’s what I like to tell those who abuse statistics with reckless abandon...

... My story is about what was once hailed as Matheron's new science of geostatistics. Matheronian geostatistics plays a role not only in reserve and resource estimation for the world’s mining industry but even more so in the study of climate change. Real statistics turned into geostatistics under the leadership of Professor Dr Georges Matheron, a French geologist and a self-made wizard of odd statistics. My 20-year battle against the geostatocracy and its army of degrees of freedom fighters is chronicled on my website. I have brought my concerns to the attention of the Federal Government of Canada, the Provincial Government of British Columbia, the Ontario Securities Commission, the US Securities and Exchange Commission, and the US Senate Committee on Transportation, Science, & Technology.

Dr Frederik P Agterberg, Past President, International Association for Mathematical Geosciences, called Professor Dr Georges Matheron (1930-2000) the Creator of Spatial Statistics. Agterberg ranked him on a par with giants of mathematical statistics such as Sir Ronald A Fisher (1890-1962) and Professor Dr J W Tukey (1915-2000). Agterberg was wrong! Matheron failed to derive the variance of his length-weighted average in 1954 and in 1960.

Agterberg's distance-weighted average point grade

Agterberg himself failed to derive the variance of his distance-weighted average in his 1970 Autocorrelation Functions in Geology and again in his 1974 Geomathematics. Agterberg’s problem is that as few as a pair of measured values, determined in samples selected at positions with different coordinates in a finite sampling unit or sample space, gives an infinite set of zero- dimensional, variance-deprived distance-weighted average point grades. Infinite sets of kriged estimates and zero kriging variances are the very reasons why the world's mining industry welcomed geostatistics with reckless abandon. Geostatistics converted Bre-X’s bogus grades and Busang’s barren rock into a massive phantom gold resource. I applied analysis of variance and proved the intrinsic variance of Busang's gold to be statistically identical to zero. How many mineral inventories in annual reports are bound to shrink during mining?

Lord Kelvin (William Thomson 1824-1907) once said, “…when you can measure what you are speaking about, and express it in numbers, you know something about it, but when you cannot express it in numbers your knowledge is of the meagre and unsatisfactory kind…” Lord Kelvin knew more about degrees Kelvin and degrees Celsius than about degrees of freedom. Lord Kelvin and Sir Ronald A Fisher (1890-1960) were marginal contemporaries. Lord Kelvin would have wondered about the wisdom behind assumed spatial dependence between measured values in ordered sets. Sir Ronald A Fisher could have verified spatial dependence by applying his F-test to the variance of a set of measured values and the first variance term of the ordered set.

Not all scientists need to know as much about Fisher's F-test as do geoscientists. All too few know how to verify spatial dependence by applying Fisher’s F-test, and how to derive sampling variograms that show where orderliness in our own sample space of time dissipates into randomness. So much concern about climate change! So little concern about sound sampling practices and proven statistical methods! I make a clear and concise case against geostatistics on my blog and on my website. Surely, sound sampling practices and proven statistical methods ought to be taught at all universities on this planet and be implemented in all international standards. I’m working hard to make it happen. What will ... do about it?

Sunday, March 15, 2009

Statistics for geoscientists

What struck me as odd is that spatial dependence between measured values in ordered sets may be assumed. Incredibly, it was Stanford’s own Journel who put forward in the early 1990s that spatial dependence may be assumed without proof. What’s more, he deemed my reading “too encumbered with Fischerian (sic) statistics”. Much to my surprise, JMG's Editor didn't agree with Journel but didn't disagree enough to pose questions.

Stanford’s Journel was Matheron’s most gifted disciple. So, he was bound to take a shine to his master’s voice. It explains why he accepted false variances and rejected true variances. All the same, IAMG’s mission points to real statistics. Agterberg ought to explain why his distance-weighted average point grade lost its variance some 40 years ago. And he ought to explain why it is too late to reunite his distance-weighted average with its long-lost variance. What I want to do is show geoscientists how to apply Fisher’s test and verify spatial dependence between measured values in ordered sets, how to count degrees of freedom, and how to derive the statistics for a set.

I applied for and was granted permission to access Environment Canada’s massive data base for monthly temperatures by location. I downloaded several sets of monthly temperatures for a few interesting locations. What turned out to be a gem to work with was the set of monthly temperatures for Coral Harbour, Nunavut, for the period from 1933 to 2007. This time, I didn’t test whether or not monthly variances constitute a homogeneous set. I may apply Bartlett's chi-square test at some later stage. The first stage of the statistical analysis is to derive and plot the ordered set of annual means in a chart.

Plotting trend lines is popular but deriving sampling variograms makes statistical sense. Excel spreadsheet templates are the most effective show-and-tell tools. Working with Riemann sums and deriving variance terms of ordered sets is straightforward in spreadsheet templates.

The above sampling variogram shows that the first variance term of the ordered set is lower than the variance of the set and higher than the lower limit of the asymmetric 95% confidence range. Hence, the variance of the set and the first variance term of the ordered set are statistically identical. In fact, each of the variance terms is statistically identical to the variance of the set. A significant degree of spatial dependence would dissipate into randomness. It should not reappear without a rational reason such as running out off degrees of freedom. Too few degrees of freedom would give atypical sampling variograms.

The central value of -11.58 centigrade for this period has a symmetric 95% confidence range (95% CR) with a lower limit of 95% CRU=-11.85 centigrade and an upper limit of 95% CRL= -11.31 centigrade. The observed absolute difference of O|dx|=1.83 centigrade between xbar(1933)=-13.08 and xbar(2007)=-11.25 is below the expected absolute difference of E0.05;|dx|=2.43 at 95% probability. Hence, annual temperatures during this period of 75 years did not vary significantly.

The question is whether or not the observed difference between the lowest temperature and the highest differ significantly. The observed absolute difference of O|dx|=7.04 centigrade between the lowest annual mean of xbar(1972)=-15.25 centigrade and the highest of xbar(2006)=-8.21 centigrade exceeds E0.001;|dx|=3.61 by a wide margin. Hence, xbar(2006)=-8.21 is significantly higher than xbar(1972)=-15.25. The probability that this statistical inference is true exceeds 99.9%. The annual mean of xbar(2006)=-8.21 centigrade was measured in the Canadian Arctic during the famous hockey stick year. Next year, the annual mean had cooled down to xbar(2007)=-11.25 centigrade.

A set of n annual means gives df=n₋1 degrees of freedom whereas the ordered set gives df=2(n₋1) degrees of freedom. The symbol for the variance of a set is var(x). The symbol for the variance of the jth term of the ordered set is varj(x). The second term has but 2(n-2) degrees of freedom because the last but one annual mean is no longer used. This is why each next term has two fewer degrees of freedom than the previous one. Geoscientists ought to take counting degrees of freedom more seriously than geostatisticians do.