Tuesday, April 26, 2011

Lies, Damned Lies, and Heuristics

(Partly inspired by this; partly my ongoing ruminations. Warning: this is going to ramble.)

Social scientists are often accused of having "physics envy," meaning that they wish they could reduce complicated relationships to equations in the same way that physicists can determine μ for friction. If you run enough experiments, you can figure out exactly how much energy is lost when you slide rubber over wood. So, why can't we run a bunch of experiments and figure out what happens in the average election when unemployment is one percent higher, or what happens to a community when income inequality rises by 15 percent? What happens is, you get pathetically small correlations that you are forced to cling to because there is nothing better. 

Sports statisticians don't get physics envy, but they do get baseball envy. Sabermatricians have baseball so well digested that the average layperson can no longer enter contribute meaningfully. Because it is so much easier to isolate variables in baseball, observations and conclusions can be stacked one on top of another to create a mathematical chain of stunning complexity. Because all of the links are reasonable, the chain as a whole remains strong.

Yet even in baseball, the prediction systems have proven less than deadly accurate. Sure, PECOTA can get most teams pegged to within a few games of their actual record. But then again, you can guess "81-81" for every team and not be too far off the mark; if you add in a few obvious adjustments (Pirates and Royals 71-91; Yankees and Red Sox 91-71), you'll do even better. And this is in baseball, perhaps the easiest of sports to analyze statistically. 

This isn't to knock PECOTA or to cast aspersions on the task that statisticians have chosen. I am skeptical, however, of efforts to distill statistics down to a single "omnistat" that can explain everything. The dangers aren't just that these omnistats will not be predictive--that's true--but also that they'll obscure everything that was useful about them in the first place.

I think most sports have passed through, or are passing through, three phases of stat gathering. The first phase is remarkably crude and only includes whatever is easiest to count. In baseball, that would be adding home runs, runs batted in, errors, and the like. These stats tell us something, and sometimes tell us a lot, but there is much that they obscure. The second phase involves a search for new counting statistics that get at success in a new way: on base percentage, for example. The third phase involves combining many of these statistics to create what I've termed "omnistats" above. These omnistats represent past success, but because they are more accurate reflections of the past, they are also relied upon for their predictive value.

This isn't a baseball-specific observation, though the examples are more universal in baseball because most sports fans are familiar with the story, thanks to Moneyball. Take a sport where the statistical revolution came even sooner than in baseball: horse racing. It's natural that horse racing would be at the vanguard of new statistical techniques, because it has long been one of the few legal betting options for sports fans. 

The original horse racing statistics had some value, but not much. For example, you'd think that how quickly a horse runs would be very important, but it turns out that raw times are not particularly helpful for telling how well a horse ran (there's an old racing adage that "time only matters in prison.") Final times could be thrown off by how quickly the horses ran, how hard the track was, how close to the rail a horse stayed during the race, and much else.

That was phase one. In phase two, some smart people figured out how to isolate some of those variables. Given some assumptions (such as that horses that run for the same value across the country are roughly the same ability, and then horses that run for bigger purses are better than those that run for smaller purses), they could extrapolate a speed figure from the final times. Now, a useless raw time could be turned into a number that better represented the final time. These original speed figures explained some things but purposefully ignored others, such as how fast the horse ran early in the race, or how heavy the jockey was.

The natural next step was to add all these other factors in to create one, all-encompassing number that explains everything. So some people did this. By and large, they are awful (and not only because if everyone is betting off the same number, there'd never be any disagreement on a race and no one would win money). Better to isolate one thing really well (raw speed, adjusted for a factor or two) than create an omnistat that tries to prove everything and fails.

Still, the temptation to create these omnistats is strong. We can't capture every variable,and pace Chesterton, not everything worth doing is worth doing poorly. Rather than try and fail, we should just isolate the variables about which we can be sure. With lots of small, nice stats that explain something useful, we can put pieces into place in our mind, see the whole picture, and come to a subjective conclusion. 

The process will never be scientific, but it will, at the very minimum, eliminate the temptation to fiddle with data to make it more predictive at the expense of reflecting what actually happened. The most useful stats are those that accurately capture the essence of what happened. We argue from first principles which stats do this better (we explain, for example, why on base percentage reflects reality better than batting average). We then have one useful clue as to quality: on base percentage. We accumulate clues in all areas of the sport, and then we can put together the clues to form beliefs about the whole.

There's one more danger to these omnistats; once useful information gets baked into the cake, it is impossible to figure out what is useful and what is not when the stat provider takes a proprietary stance towards the stats. This is my problem with strength of schedule adjustments made to statistics: they are useful, yes, but also useful is having the raw data. Sometimes, the result of a SoS-adjusted stat is greater accuracy; sometimes, it is not. But an open source attitude towards statistics allows many people to examine where any why your assumptions may lead you astray. Omnistats provided without context cannot allow for this.

I write all this by way of introduction for the upcoming football season. I am a believer in many advanced stats, and I think many old fashioned stats are worse than useless. For example, I think points scored per drive is incredibly useful, while total yardage is significantly less useful, and time of possession is almost useless. As we approach the season, I will introduce some numbers I think are important. Sometimes, I will make adjustments to those numbers. But I will also provide raw, unadjusted numbers for context wherever possible (that is, wherever I'm the person creating the number). I will link this post quite often while doing so as a disclaimer my use of those numbers. Hopefully, eventually, we will separate some wheat from the chaff.