"Big Data - A Revolution That Will Transform How We Live, Work, and Think" is a thought provoking take on the increasing use of large pools of accumulated data for predictive analytics.
The authors are Viktor Meyer-Schönberger, professor of Internet governance and regulation at Oxford University, and Kenneth Cukier, data editor for The Economist.
As such, regular readers of The Economist will recognize some of the early examples cited in the book, such as the Farecast business modeled on predictive ability for airfare price changes and Google search predicting the spread of flu. Do not let that put you off picking up the book and continuing to read it.
The authors do a good job of first defining Big Data as the "ability ...to harness information in novel ways to produce useful insights or goods and services of significant value." It is not just about having large databases. Those have existed already.
Big Data, the authors write, is about doing things at that large scale that cannot be done at the small scale. Specifically, they contrast the practices of Big Data with statistical sampling. What happens when N is no longer just a statistically relevant sample size, but instead N = all ?
That is what allows things I am familiar with, like probabilistic location tracking of WiFi devices that was pioneered at Newbury Networks, or machine translation of human language done at Language Weaver, now part of SDL. Tough problems made easier by consuming all data and then making a best guess at what a new sample represents.
It also means that correlation trumps causation. It is true when dealing in small samples that one must wonder if correlation is just a fluke. But once N = all, does that even matter?
The authors argue that for making quick intelligent decisions about how to apply resources the answer is no. New York City teams used Big Data to find which buildings receiving complaints of certain types were more likely to present a fire hazard, and past data to show which manhole covers had more of a danger of exploding. The results were a huge return on the efforts of fire inspectors and repair crews. The "why" does not factor in. They found the model, and it worked.
The authors also use a term that I love, "data exhaust." This refers to secondary information produced in transactions that is secondary to the central transaction, but also useful. Not just that you clicked a banner ad, but where on the ad you clicked. Not just that you purchased an item, but at what time, and what the adjacent purchases are in time. Disney is poised to make use of this, for example.
There is also an exploration of the potential risks of using Big Data when making decisions related directly to individuals. The authors warn about prejudging people based on Big Data (for instance, denying parole based on correlation for high chance of recidivism).
They also advocate that firms create a role of "algorithmist" to review proposed use of data to avoid actions or programs harmful to individuals.
The book does a fine job exploring what Big Data allows, who stands to benefit from it, and how society can mitigate and possible harm. Well worth a read for a topic that is only going to grow in importance.

No comments:
Post a Comment