Wednesday, April 3, 2013

The Hidden Biases in Big Data

     With big data hype reaching new found heights, Kate Crawford of the Harvard Business Review examines potential faults in the analysis of large data sets.  According to her, "The hype becomes problematic when it leads to what I call "data fundamentalism," the notion that correlation always indicates causation, and that massive data sets and predictive analytics always reflect objective truth."

      Hidden biases in collection and analysis of data present risks that must be accounted for in the big picture of big data.  For example, consider the twitter data generated from Hurricane Sandy;  20 million tweets from Oct 27 to Nov 1.  Examining the data showed that the majority of tweets came from Manhatten, which could lead one to believe Manhatten was affected the most.  As power outages and batteries lost charge, even fewer tweets came from the harder hit areas, skewing the data even more.   Kate Crawford suggests situations like this to be a "signal problem" within big data.

     Another example of this "signal problem" is in Boston where a Streetbump smartphone app was developed to get citizens involved in reporting potholes spotted on city streets.  This "signal problem" shows because of the lack of smartphones found in urban areas and also with the elderly.  For Boston, this means they are missing sources of input from a significant part of their popuation.

     So what should big data scientists do to avoid these hidden biases? In the short term Kate Crawford suggests they should, "take a page from social scientists, who have a long history of asking where the data they're working with comes from, what methods were used to gather and analyze it, and what cognitive biases they might bring to its interpretation." In essence big scientists must first ask the question "why?" and not "how many?"  Only then will the depths of big data be revealed.
     
http://blogs.hbr.org/cs/2013/04/the_hidden_biases_in_big_data.html

No comments:

Post a Comment