Tuesday, April 23, 2013

The Future of Data Mining - "Fast Data"





Firstly, here are sum statistics from the article I read for this particular blog post:
  • Every minute:
    • 48 HOURS of video are uploaded on Youtube
    • 204 million e-mails are sent
    • 600 new websites pop up
    • 600,000 pieces of content are shared on Facebook
    • Upwards of 100,000 tweets are sent

This article stresses the idea that data mining is time. Author Alissa Lorentz states that we must be able to mine data as quickly as we produce it. Because the of the plethora of electronic information available today, data mining is extremely important and an issue or concept of which I was previously not aware. Lorentz discusses the difference between smart data, data that provides insight to large data sets and big data, which is a term we apply to extremely large data sets. She then elaborates on a concept she calls "fast data." Fast data will eventually be extremely useful. It analyzes data sets in real time. If one were able to analyze all of the data available on a specific company in any given day in a meaningful way, let's just say I'd be looking at the stock market.

In class, we have discussed mainly archiving data, organizing data in a historical sense. This article discusses a different concept: streaming data i.e. streaming data live rather than storing it for future use. To me, this is ideal. Rather than storing messages on Facebook, providing users with a list compiled of a certain amount of friends that have recently been in contact on the social network would save memory and computing powers as well as be more useful to the user who has messages from conversations years ago. Also, in applying this concept to other situations, Lorentz talks about how streaming data would provide important information on traffic or public health issues such as flu outbreaks. With the abundance of information that is constantly being added to the web, storing and archiving this information will undoubtedly become obsolete. Instead of focusing on analyzing past data, after reading this article, I think the best direction in the data mining world would be to chase the data rather than store it. Updating data sets in real time would not only eliminate the need for large storage systems, but it would better indicate the trends occurring in the here and now. 














Link to article:
http://www.wired.com/insights/2013/04/big-data-fast-data-smart-data/

1 comment:

  1. This was a very interesting read, though I think it might be a bit idealistic. The idea that streaming and online data mining has or will soon reach the ability to run complex modeling that can be run on offline and archived data seems to me to be a bit of a stretch. The sheer amount of computing power required to run complex algorithms on data puts such a high barrier of entry that it could only be realistically used by the very top of the technology industry. Even Netflix recognizes that using only streaming and online data analysis, as I mentioned in one of my earlier blogs. Amazon and Google may have the ability to make real-time big data analysis but for the rest of the world, the computing power required would be price-prohibitive.
    It may be argued that the cost will surely go down over time, but the complexity and thus computational requirements will likely also continue to rise. So while it may become financially possible for smaller companies to do the kind of analysis that is on the bleeding edge right now, the larger and richer companies will always be able to do better and faster analytics. A much better solution to attempting to take all of your data mining online would be to mix between using online data mining to supplement offline analytics.

    ReplyDelete