Thursday, February 21, 2013

Analysis of Big Data by Twitter and Other Platforms


Hadoop and Twitter:

When I first heard the term “Hadoop” it was so foreign to me. I had no idea what it meant or how it was used. Little did I know, something that most of us are very interested or at least familiar with, uses Hadoop. Twitter is something that all of us have at least heard of, even if we don’t use it. Twitter uses Hadoop because it is able to store and process large data sets. Hadoop works with a distributed file system and map-reduce implementation. The distributed file systems work in a way that mocks hundreds of computers working on one huge drive. Information is replicated so that data isn’t lost. Map-reduce is a way to break down the analysis of data into chunks that can be done in parallel. Twitter takes in so much information every day (approximately 400 million tweets daily) that it would be impossible to run without this kind of system.



Topsy Analytical Tool:


Although Twitter has so much information, the website itself isn’t necessarily known for its high-level analytics. There are numerous other websites that exist to access twitter information in a way that Twitter doesn’t allow users to see. I have looked in to several of these tools and read one article called “New Topsy tool can spot Twitter trends before they blow up” (the link is posted under Sources).  Topsy is a tool that allows users to monitor and search tweets for certain words. It can be set up so that a user will be notified if a specified number of tweets are about a chosen topic. This is a tool that is often used for those who are selling products. Once the threshold number of tweets is met, a promotional act will take place. This kind of tweet tracking is also useful in following political polls, etc. Topsy is far more advanced than Twitter in finding the geographical location that tweets originate. In fact, Topsy is able to provide this location 15-25 times more often than Twitter can because it uses a number of different signals to tie a location to a tweet besides geotagging. While Twitter gives the user the option to geotag tweets, only approximately 3% of all tweets are geotagged.
            



Twitter Dashboard for Disaster Response:


Topsy and other platforms are not only useful for promotional activities and politics, but can also be used in more serious times. When a natural disaster hits and people need help, they often resort to social media to share their cries for help. There were more than 20 million tweets posted during Hurricane Sandy alone. Patrick Meier and his team are working on the Twitter Dashboard for Disaster Response. He claims that there are two ways to handle such large amounts of data during a crisis. One is Advanced Computing that implements machine-learning algorithms that tag tweets based on their content. Classifiers vary depending on the type of crisis or disaster. The idea is to take datasets from these different disasters and find things that can be pre-developed classifiers. The dashboard that Meier plans to create will also have the ability to create classifiers in real time using Human Computation, the other method of data management. As tweets come in, they can be tagged with a certain classifier within a hashtag. When the new classifier is run, then incoming tweets will be scanned for the classifiers’ “requirements”. Meier’s idea seems to be genius. If he and his team can get these methods to come together, along with being able to better identify geographical location of tweets, the opportunity to help people in need will be so much greater.




Stuff to Think About:

While we may think (at least I sometimes do) that all of the software and computer science of Big Data seems like Mumbo Jumbo, mostly because we're not experts, it is cool to step back and realize all of the good that it can do. Being able to analyze large sets of data can go to saving lives or other cool things. There is so much data out there about anything that each of us is interested in, and so much data is found in the social media world. We are able to access all of this data at our fingertips due to distributed file systems, Hadoop, and other platforms!



PS. According to Google, Topsy is recruiting Software Engineers for Core Hadoop Platform work if anyone wants to live in San Fransisco! (Twitter isn’t the only one using Hadoop!)


PPS. For some reason all of the paragraphs will not post in the same font. Technology is great, but definitely isn't perfect!



Sources:





            

2 comments:

  1. I strongly believe social media outlets, such as Twitter, are moving little by little to becoming the "holy grail" of information. As these sites become more popular and easier to access through smartphone/tablet technology, more and more information about people and their interests becomes available. Many users of these sites tend to put information about themselves, their interests, and opinions out there for everyone to see. The best part about this information is...ITS FREE! Anything a company or agency might want to know about peoples' feelings about issues or products or whatever is out there. For example, with appropriate hashtags and participation, Twitter could be used to poll a VAST population (on whatever the issue may be) in a matter of minutes! I think these outlets will eventually be the only place to look to obtain information. The "nitty-gritty" part will be developing software algorithms that can sift through this incredible amount of data and retrieve relative, accurate results.

    ReplyDelete
  2. Brianna, Great post. I also agree with your statements in the "Stuff to think about Section". I believe one relevant article in the area of humanitarian logistics can be found in http://koenigstuhl.geog.uni-heidelberg.de/publications/2010/Neis/un-osm-emergency-routing.gi-forum2010.full.pdf

    Carter, yes I agree with you totally here. There is a new term "people as sensors" which greatly describe this phenomenon. See http://www.mdpi.com/1424-8220/8/5/3037

    ReplyDelete