Wednesday, April 3, 2013

How Big Data can be misleading.

Big Data has so much potential to shed light on so many concepts, new ideas, and insights that many people have flocked to it as some sort of catch-all.  But Big Data can have its problems as well.  Mostly when people who are studying Big Data forget that correlation does not always equal causation.  For example, a study of Hurricane Sandy-related twitter and FourSquare data (research paper) produced some expected findings. Mostly packed grocery stores the night before the storm hit.  This collection of data does not fully represent what occurred over that period though.  The majority of the twitter data came from the very populated and higher smart phone ownership area of Manhattan.  This would make one think that Manhattan was the center of the area most affected by the storm, but this is not true.  As the flood water caused extended power outages this would lead to people's smart phone's batteries dying therefore not allowing them to tweet.  This is what happened in some the harder hit areas like Coney Island.  This is referred to as a "Signal Problem" where there is no signal coming from certain areas or communities due to particular factors.

Another example of this "Signal Problem" would be with an app used by the City of Boston to fix potholes.  The phone app uses accelerometer and GPS data to passively detect potholes around the city.  But, if you think about it, this data only provides part of the picture of the potholes around the city.  This method will not be able to detect potholes in areas of the city with low smart phone ownership, lower income and areas with a high elderly population.  As you can see, Big Data can tell us some much about many of the problems we face today, but we have to remember that it is not the entire picture.  We have to remember and consider what areas are being left out of the data and close these gaps.

Sources:

http://sm.rutgers.edu/pubs/Grinberg-SMPatterns-ICWSM2013.pdf

http://blogs.hbr.org/cs/2013/04/the_hidden_biases_in_big_data.html

1 comment:

  1. Actually most of the times the misleading one in a data mining research is not the data, wrong predictive models. This kind of misleading studies bad connotation to data mining. A predictive model is built using past data and allow predicting future situations (generalization principle). The process behind the model selection may be due, among others, to predictive modelling. In addition, data mining is about making sense of data whether it is for prediction (supervised learning) or description (unsupervised learning). So predictive modelling is a part of data mining.
    You create a data mining model by following these general steps:
    • Create the underlying mining structure and include the columns of data that might be needed.
    • Select the algorithm that is best suited to the analytical task.
    • Choose the columns from the structure to use in the model, and specify how they should be used—which column contains the outcome you want to predict, which columns are for input only, and so forth.
    • Optionally, set parameters to fine-tune the processing by the algorithm.
    • Populate the model with data by processing the structure and model.


    http://msdn.microsoft.com/en-us/library/cc645779.aspx

    ReplyDelete