Monday, July 21, 2008

I have a Theory

By Roderick Flores

It was with great curiosity that I read Chris Anderson's article on the end of theory. To summarize his position, the "hypothesize, model, and test" approach to science has become obsolete now that there are petabytes of information and countless numbers of computers capable of processing that data. Further, this data-tsunami has made the search for models of real-world phenomena pointless because, "correlation is enough."



The first thing that struck me as ironic about this argument is that statistical correlation is itself a model including all of its associated simplified and assumptive baggage. Just how do I assign a measure of similarity between a set of objects without having a mathematical representation (i.e. a model) of those things? How might I handle strong negative-correlation in this analysis? What about the null hypothesis? While not interesting, per se, it is useful information. Will a particular measurement be allowed to correlate with more than a single result-cluster?



Additionally, we must decide how to relate these petabytes of measurements into correlated-clusters. As before, the statistics that are used to calculate correlation are also models. Are we considering Gaussian distributions, scale-invariant power-laws, or perhaps a state-driven sense of probability? Are we talking about events that have a given likelihood such as the toss of a coin or, more likely, subjective plausibility? You need to be very cautious when choosing your statistical model. For example, using a bell-curve to describe unbounded-data destroys any real sense of correlation.



Regardless of how you statistically model your measurements, you must understand your data lest your correlations may not make sense. For example, imagine that I have two acoustic time-series. How do I measure the correlation of these two recordings to determine how well the are related? The standard approach is to simply convolve the two signals and look for a value that indicates “significant correlation”, whatever your model for that turns out to be. Yet this doesn't mean much unless I understand my data. Were each of these time-series recorded at the same sampling rate? For example, if I have 20 samples of a 10Hz sine-wave recorded at 100 samples per second it will appear exactly the same as 20 samples of a 5Hz sine-wave recorded at 50 samples per second. If I naively plot the samples, they will correlate perfectly. Basically, if I don't understand my data, I can easily erroneously report that the correlation of the two signals is perfect when in fact they have zero correlation.



Finally, what I find most intriguing is the presumption that the successful correlation of petabytes of data culled web-pages and the associated viewing habits data somehow generalizes into a method for science in general. Unlike the “as-seen on TV” products I see in infomercials, statistical inference is not the only tool that I will ever need. Restricting ourselves to correlation removes one of the most powerful tools we have: prediction. Without it, scientific discovery would be hobbled.



Consider, the correlation of all of the observed information regarding plate-boundary movement (through some model of the earth) along a fault such as the San Andreas. Keep in mind that enormous amounts of data are collected in this region. Anyway, quiet areas along the fault would either imply that a particular piece of the fault were no longer seismically-active or, using anti-correlation, that the “slip deficit” suggested that a much larger earthquake was more likely to occur in the future for that zone (These areas are referred to as seismic gaps). Moreover, the Parkfield segment of the San Andreas fault has large earthquakes approximately every twenty years. A correlative model would suggest that the entire plate-boundary should be similar which is simply not true as proven by the Anza Seismic Gap. Furthermore, correlation would also have implied that another large event should have occurred along the Parkfield Gap in the late 80s. If science were only concerned with correlation, one instrument in this zone would have been sufficient. However, the diverse set of predictions made by researchers demanded a wide variety of experiments. Consequently, this zone became the most heavily instrumented area in the world in an effort to extensively study the expected large event. They had to wait for over fifteen years for this to happen. Then there are events that few would have predicted (Black Swans) such as “slow” earthquakes which require special instrumentation to capture. These phenomena, until recently, were not able to be correlated with anything and thus, never would have existed. In fact, one of the first observations of these events was attributed to instrument error.



Clearly correlation is but one approach to modeling processes amongst many. I have a theory that we in the grid community can expect to help scientists solve many different types of theoretical problems for a good long time. Now to test...