Analytics and Visualization of Big Data: March 2013

Sunday, March 31, 2013

Customer network value

In marketing area, customer has a life time value. This value could measure one customer's potential consuming ability, which means the profit could obtained form this customer. Now, as social network developing, customer all have network value, which a new part of customer value.

Network value is a value to measure one customer influence on other customers. To obtain this value, the most common way is data mining.

In the figure below, it show the social network connect by iPhone Different colors represent models of i Phones. The meat of this figure is that more and more people have been connected into social network

So, using social network to sell product or do market survey could be a new and effective way of marketing. The First step is find the customer with highest network value, and the method is data mining.

To perform data mining data collection is necessary. Usually, companies have their own social network pages. Such as, Fractal Design a computer case manufacturer has Facebook page and twitter account. Their customer could connect to their pages as fans or followers, and they the customer information is fetched by sellers. Then, they could find customers:
Have many connections,
Always talk about their products

Customer with these features are most like leaders of customers. Companies could focus on these kind of customer more. For instance, they find a customer always review their computer cases on his Facebook pages and use it to build systems, they could send some new and free products to him, then he could becomes a node of advertisement on the social network.

Ref: http://predictive-marketing.com/index.php/tag/social-network-analysis/

Improving soldiers' performance using Big Data

If the information about soldiers deployed in the battlefield is easily acquired, how much will the operation be improved? Equivital, UK based company, developed a wearable computer, called Black Ghost, that can sense the critical information about soldiers, like health status and location, and reply it to the headquarter. By monitoring heart rate, respiration or GPS data, the commander can know if a soldier's performance deteriorates over a certain period or he/she crosses the border.
The LifeMonitor together with Black Ghost provides data management and visualization of data. The big driver of this system is the ability to gather and centralize performance data from multiple soldiers over time. This allow better understanding of soldier and squad performance over time and how to improve it through optimized methods. Also, it helps with soldiers to quickly identify areas in the field that could leave them vulnerable to attack.

Reference: http://www.wired.co.uk/news/archive/2013-01/21/equivital-black-ghost

Crunching Big Data with Google Big Query

Ryan Boyed who is the developer advocate at Google and focuses on Google Big Query presents first part of this video, and in the five years at Google, he helped build the Google Apps ISV ecosystem. Tomer Shiran who is the director of the product management team at MapR and is the founding member of Apache Drill presents the second part of this video.

The developers have to face different kinds of data and a large number of data, without good analyzing software and useful analyzing methods, they have to use lots of time to collect a big amounts of data and then throw some “invaluable” data, but most of time, the “invaluable” data has their own potential value. Google has a good knowledge of Big Data with the situation that every minute there are countless of users using Google’s products such as Youtube, Google Search, Google+ or Gmail. With the big amounts of data, Google begins to use APIs or other technologies to make the developers focus on their fields. Google Big Query also known as Dremel is a Google internal technology for Big Data analysis, and Apache Drill as Wikipedia says “Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google's Dremel system which is available as an infrastructure service called Google BigQuery. One explicitly stated design goal is that Drill is able to scale to 10,000 servers or more and to be able to process petabytes of data and trillions of records in seconds. Currently, Drill is incubating at Apache.” Apache Drill can make users query terabytes of data in seconds and can support Protocol Buffers, Avro and JSON data formats, with using data sources, it can use Hadoop and HBase. MapR Technologies is the open enterprise-grade distribution for Hadoop, which is easy, dependable and fast to use, and is the open source with standards-based extensions. MapR is deployed at one thousand’s of companies from small Internet startups to the world’s largest enterprises. MapR customers analyze massive amounts of data including hundreds of billions of events daily, data from ninety percent of world’s Internet population monthly and data from one trillion dollars in retail purchases annually. MapR has partnered with Google to provide Hadoop on Google computer engine. Drill execution engine has two layers which are operator layer and execution layer, the operator layer is serialization-aware to process individual records and execution layer is not serialization-aware to process batches of records to be responsible for communication, dependencies and fault tolerance. MapR can provide the best Big Data processing capabilities and is the leading Hadoop innovator.

Sources:

http://www.youtube.com/watch?v=I1Z8J5JvKtY

http://en.wikipedia.org/wiki/Apache_Drill

At the Intersection of Biology and Technology

As big data increases its importance, the companies have started to explore the new ways to use it.

Smart companies are gathering massive amounts of data and correlating it with other sources to produce new insights. This is where big data and big data analytics come in. Big data is growing into a catalyst for change on a global scale with, seemingly, limitless possibilities.

The convergence of biotechnology and bioinformatics is providing a great advantage to the companies to gather and analyze data as well as what they can learn from that data.

MC10 and Proteus are the companies which use both “wearable” technologies and digestible microchips to gather and analyze information about processes like brain activity and hydration levels that they intend to use for noble causes like lowering costs and increasing levels of care.

Sano Intelligence also plans to use these wearable devices to “capture and transmit” blood chemistry information continuously to an analysis platform to capture the information from the the human body.

Although the debate continues on how individual physiological data can be legally and ethically used, smart brains are applying new technology to reveal the information underlying the massive amount of data.

http://bits.blogs.nytimes.com/2012/09/07/big-data-in-your-blood/

Hadoop is Old News

Even though Hadoop may be all the rage right now and expected to be the centerpiece of a billion dollar section of the software industry within the next few years, the tech that Hadoop is founded on has already been replaced within Google. Hadoop is the open source software based on two Google research papers that discuss two pieces of closed source Google software, MapReduce and the Google File System. These papers were published almost 10 years ago, an eternity in the fast paced technology market, and Google began to phase out usage of those two pieces of software with new tech in 2009. Since then Google has used research papers to detail some of their newer tech. For instance, Google has detailed the platform that creates the index for Google Search, Caffeine, as well as Pregel, a graph based database that is used to map complex relationships for the vast amount of information that Google stores. Dremel, however, appears to be the most intriguing piece of technology that Google has detailed.

Dremel essentially does what many third parties are trying to do with Hadoop, in that it allows queries similar to SQL of massive amounts of data spread out across thousands of servers very rapidly. Google goes so far as to claim that you can run queries on petabytes of data in a matter of seconds as opposed to minutes or even hours it would take for Hadoop to run a similar feat. According to Google, Dremel can run the type of queries that would take numerous MapReduce tasks in a fraction of the execution time, taking just three seconds to run a query on a petabyte of data. This is an amazing and extremely important accomplishment. With Hadoop you trade speed and response for the ability to analyze massive amounts of data, but with Dremel there would be no trade off. In a very similar way to how the open-source community spawned Hadoop after the release of papers on MapReduce, there is already a team of engineers working on an open-source variant of Dremel aptly named OpenDremel. OpenDremel appears to be a very long way from functionality though and it seems less worthwhile since Google now offers BigQuery, a service in which you can use Dremel on your own data.

Sources:

http://www.wired.com/wiredenterprise/2012/08/googles-dremel-makes-big-data-look-small/

http://bigdatacraft.com/archives/327

https://developers.google.com/bigquery/

Using Data Mining Techniques to Predict the Survival Rate For Heart Transplants

Last week I posted a blog via which I introduced my research problem and gave some statistics about the current donated hearts and the gap.
This week I will continue to share my research with my classmates.
As I mentioned we focus on predicting how many years can a specific person live with the donated new heart.
In order to solve this problem , the first and the main problem is to determine the factors ( variables) affecting the result.

Conventionally, researchers have been dealing with small sets of dataset with using conventional statistical techniques which does not take collinearity and the nonlinearity into account, as it was discusses in the previous blog. They use some non-parametrical and non-statistical techniques that are computationally expensive and need prior knowledge about the data .

The biggest advantage of todays world is there is a flood of big data in the health informatics that can be dealt with data mining techniques, which reveal better and more accurate solutions for the survival of organ transplant recipients than any of the conventional methods used by previous studies.

We had started to do the research by obtaining a very large dataset from UNOS, which is a tax-exempt, medical, scientific, and educational organization that operates the national Organ Procurement and Transplantation Network.The obtained dataset has 443 variables and 43000 cases which belong to the Heart Transplant Operations. These variables include the socio-demographic and health-related factors of both the donor and the recipients. There are also procedure-related factors among the dataset.

After preprocessing the data ( cleaning, dealing with the missing values, reorganizing the data for the specific studies etc), we used variable selection methods in order to determine the potential predictive factors.

These potential predictive factors are the ones which have questioned whether they are predictive or not by using some data mining algorithms such as Support Vector Machines, Decision Trees and Artificial Neural Networks.

After doing cross-tabulation and doing sensitivity analysis , we observed that all of these three methods gave pretty satisfactory results.

For 3 Years survival study, Support Vector Machines gave the best prediction rate by predicting 94.43 % of the cases correctly, while artificial Neural Network 81.18 % and Decision Trees 77.65 % of them correctly.

What do these results mean ?

For support vector machine, the accuracy rate is 94.43 % , which means if Support Vector Machine is telling us that a specific person will live or die if he/she gets the donated organ , it is 94.43 % correct.But it has 6.57 % of chance to fail to predict.

It also lets us know which factors are playing a role to predict these results.

These results are pretty high results which have not been reached by using the conventional statistical techniques which is pretty promising for the future success of the heart transplants in the future.

The Mental Approach to Baseball Hitting-Big Data to Analyze Hitters Brain Function

Considered one of the most difficult tasks in sports, hitting a thrown baseball, especially at the professional level is something only the most gifted athletes on the planet can do. The issue is that performing this task requires a complex interaction between the brain and muscles in the body. Even the most physically gifted athletes are unable to hit a baseball if their mental prowess is subpar. According to this article, http://baseballanalysts.com/archives/2009/09/unraveling_the.php, a professional-level batter has approximately 50 msec (.05 sec) to react after a pitch is thrown in order to hit it. After that .05 sec, the batter is not able to alter his swing in any way from what he has decided to do. For comparison purposes, the average human eye blinks between 300 and 400 msec. This means the batter must decide whether to swing or not anywhere from 6 to 8 times faster than someone can blink, not an easy task. Making things even more difficult is that most pitchers throw three or four different pitches, many of which move in the air. So now, the batter must identify the type of pitch, decide whether it is a ball or strike and send electrical signals to their muscles to react in time in order to successfully hit it. No wonder failing 7 times out of 10 is considered an elite level of hitting performance (a .300 average). The paper at this link from the 2013 Sloan Sports Conference details research done on the subject of batter brain function in determining pitches: http://www.sloansportsconference.com/wp-content/uploads/2013/02/A-System-for-Measuring-the-Neural-Correlates-of-Baseball-Pitch-Recognition-and-Its-Potential-Use-in-Scouting-and-Player-Development.pdf

The study was done using three Division 1 college baseball players. Each player looked at 468 simulated pitches and was asked to identify the pitch type using a keyboard as soon as the simulated pitch was thrown. An fMRI and EEG scanner were used to study the subjects’ brain activity while they were identifying pitches. A linear equation was formulated to try and determine which independent variables were related to the time it took to recognize a pitch and whether it was correctly identified or not. The brain scans were used to evaluate brain activity in different areas as time passed after the pitch was “thrown”. The studies found that for all pitch types, brain activity peaked around 400 msec and 900 msec. As a reference, 400 msec would be approximately the time the pitch would cross the plate at a normal pitch speed, while the researchers speculated the second peak was a type of post-decision thinking about their choice. This study found that different regions of the brain are active for different pitch types and different regions are active for incorrect vs. correct pitch identification within a pitch type group (an example being one area is active for a correct fastball ID while another is active for incorrect fastball ID).

Some application of this research may lie in the scouting future of baseball. Players are long coveted who have the physical tools necessary to hit a baseball, but perhaps this research could better enable scouts to find out who has the mental abilities to identify pitches properly, which is key in hitting the ball. The researchers also hypothesize the information could be used in scouting reports by being able to identify what pitches batters are bad at recognizing. If a team knows their hitter is bad at recognizing curveballs mentally, they could work with them to try and correct that. Conversely, if a pitcher knows a batter has difficulty recognizing a pitch he could try to use that particular pitch more often or in a particular situation to try and get them out. The amount of data which could be generated by this research is vast and untested, but it could have an important impact on how the value of baseball players is determined in future.

Saturday, March 30, 2013

More on the Motion Chart

I posted a motion chart regarding percent GDP spend on military and GDP per capita. While I discussed a little of what caught my eye at first glance on the visualization, there is more that I wanted to briefly address. I removed a large portion of the countries in order to clean it up a little. I kept all European and North American nations, Japan, New Zealand, Australia, and South Korea. Two topics I want to discuss about these nations are the economic differences between eastern and western European nations, and the similarities in the economies of Canada, Australia, New Zealand, and South Korea.

Below is the image of the chart described above. You can see the western European (represented by blue), eastern European (yellow-green), North American (light green), Australia and New Zealand (yellow), and Japan and South Korea (red). As you can tell, with the exception of the United States, that Australia, New Zealand, Japan, and Canada all not only lie within the western European nations, but have almost identical reactions. So much so that if indicated by the same color, would be indistinguishable. This can be attributed to how closely their economies rely on the same variables. Next I am going to try and create a similar chart going back decades to observe how much slower nations react to economic changes on different continents.

When watching the motion chart, the differences between eastern (represented by blue in the image below) and western European nations (represented by red), in terms of GDP per capita, are astonishing. While it is probably no surprise that there is a difference, I know that I would not have expected as defined clusters as the two seen throughout the sixteen years of data. What causes this difference? Much can be attributed to decades of communism in the eastern nations. While they have moved from this form of economy, the effects are obviously still seen today.

It would be expected that over the next few decades that these eastern European nations will begin to migrate up the chart and join those the western Europe.

To see the motion chart of 130 nations see my visualization from last week.

How Google Search fights Spam

In class we discussed on how Google's search program works and how it was better than its predecessors due to its ability to find the most relevant web pages based on what you were searching for. But as always, there are people who are going to try and cheat the system. These people are referred to as spammers, which are people who try to get their unrelated website to come up in any search usually in order to try and push some product on the user.

There are three many ways spammers try to beat the search engine.

1. Cloaking- We talked about this in class. This is the practice of putting the searched for word in the same color as the background thereby hiding it from the user on the site, but still having a numerous amount of the searched word that the search engine will read and therefore think the site is relevant.

2. Keyword Stuffing- This is similar to cloaking. This is when a website plasters a numerous amount of the keyword on the website, usually at the bottom of the page, in order for it to get the search engine to believe it is relevant to the search.

3. Paid Links- This is when a website pays other websites to link to its page in order to increase its PageRank, which we discussed in class is how Google works by finding the importance of the webpage based on the "votes" by links on other webpages.

The paid links are a little harder to discover, but usually if a site has been selling links Google will no longer trust the links from that page.

Source:

http://www.google.com/competition/howgooglesearchworks.html#section4

smart grid and data mining (2)

Speaking of smart grid, power engineers and energy companies are pretty excited, not only because it will bring in a new technique revolution but also it is going to involve a huge number of money--- Only about $356 million of it today, but potentially $4.2 billion of it by 2015. Which is meaning that it will reach a cumulative $11.3 billion between 2011 to 2015. That’s what Pike Research predicts for the global market for smart grid data analytics, or software and services that can mine data and provide intelligence for smart grid vendors, utilities and consumers.

As a result, most utilities around the world have to face up to new problems: how to deal with a flood of smart grid data in the upcoming years, they will also need to mine that data to find ways to cut costs, improve customer adoption and better predict future power needs. In a sense, how well the utilities deal with the above challenges will affect the destiny of the whole smart grid industry.

There is no doubt about that applying the smart algorithms and applications of the Internet industry to the smart grid could generate a host of new ways of doing business. On the utility operations side, smart meters and distribution automation systems are supposed to be data-mined to optimize the flow of power or predict when equipment is most likely to fail. On the customer end, behavioral data and market analysis can also be applied to entice more and more people into energy efficiency programs, or help them choose which energy-efficient appliances to buy.

A host of IT giants are already involved in smart grid data analytics, including Accenture, Capgemini, HP, IBM, Microsoft, Oracle, SAIC, SAP and Siemens among them. Smaller, newer entrants include OPOWER, OSIsoft, Telvent, Ecologic Analyticsand eMeter.

Utilities are also seeking help in upgrading customer relationship management to handle the shift from monthly power bills to daily or hourly interactions via smart meters. As for concerns over home energy data security and privacy, Pike predicts that smart grid IT players such as Cisco, IBM, Microsoft and Oracle will play an important role.

reference:

http://gigaom.com/2011/08/15/big-data-meets-the-smart-grid/

Sports fraud detection using data mining techniques

People are loving sports in the states. Also people hate fraud activities in sports area. It is believed that too many sports frauds will lead to losing popularity and audiences, and eventually destroy the whole sport. However, for some reasons, fraud in sports has always been existing in reality. Therefore, it is essential for people to find out ways to detect sports fraud activities.

Before introducing sports fraud detection, I’d like talk a little bit about sports fraud. According to people’s observations, there are mainly three categories of fraudulent activity in sports: poor player performance, a pattern of unusual calls from the referee, and lopsided wagering. For the previous two, basically they are trying to manipulate the betting line. There are several cases have been reported by media. A recent example of this was in the summer of 2007 when NBA referee Tim Donaghy was investigated and convicted for compromising basketball games to pay off gambling debts. Lopsided wagering can be used as an indicator of a compromised game. This type of wagering could involve betting in excess of what is normally expected or betting heavily against the favorite.

As you might know, Las Vegas Sports Consultants Inc (LVSC), which sets betting lines for 90% of Las Vegas casinos, is one of the organizations that actively looks for fraudulent sports activity. What The LVSC statistically analyzes are both betting lines and player performance in order to look for any unusual activity. Player performance is judged on a letter-grade scale (i.e., A-F) and takes into account variables such as weather, luck and player health. Taken together, games are rated on a 15 point scale of seriousness. A game rated at 4 or 5 points may undergo an in-house review, 8-9 point games will involve contact with the responsible league. Leagues are similarly eager to use the services of LVSC to maintain game honesty. The LVSC counts several NCAA conferences, the NBA, NFL, and NHL, as some of its clients.

As practice, Las Vegas Sport Consultants are not the only gambling institutions with an interest in honest and fair sports events, offshore betting operations are starting to fill this role as well. One popular offshore gambling site, Betfair.com, has signed an agreement with the Union of European Football Associations (UEFA) to help monitor games for match-fixing, unusual results, or suspicious activity.

Big Data and Film

After cloud computing changes people’s mind in business, the Big Data comes out and changes business operation and analysis method to increase the efficiency and mobility. Lots of different industries get benefits from Big Data, and the entertainment business also takes Big Data into operation, such as analysts can use the amount of nominations, wins in different awards shows, and information from betting websites data to do some researches with using Big Data to predict the major award winners in some specific award ceremony; music executives through analyzing data to acquire the customers’ listening habits and musical inclinations, and they also can use data analysis to decide the position and the kinds of musicians to give their shows; and movie studios and distributors or other movie organizations can use Big Data to analyze data and then to make decisions about the promotion methods, releasing film dates and releasing locations to make sure that the films could make profits.

http://www.thinkgig.com/entertaining-change-big-data-and-cloud-transform-film-industry/

Now we will see ten movies that could help us to get familiar with Big Data through an interesting way.

V for Vendetta (2005 - James McTeigue) Because of V-Trinity, which stands for Velocity, Volume and Variety, V means more than Vendetta in Big Data, in addition, Velocity means real time processing, Volume means getting useful and enough information through a large amount of data, and Variety means to use and to relate different kinds of data to help make the final decisions.

The Fast and the Furious (2001- Rob Cohen) Keep your eyes on your data and make them to “speak out” useful information. You could use Big Data to analyze your business and predict which decision could make a difference and which decision could make you bankrupt, and you could imagine that you exploit data to get knowledge whether to be fail or to be successful in a high speed, with the data driven culture, you could have the idea of dying soon or triumphing soon. Fail fast or you are going to get furious!

The Gold Rush (1925 - Charlie Chaplin) The data is like the new mine, and a large number of companies want to exploit the potential of the data to get more and more information to help them make money in business, but it is very difficult to build up the data driven culture in the organizations which is like Chaplin’s hard trip in Alaska. If the organizations want to have an easier and happier way to take the revolution, they should know how to avoid the fatal mistakes in Big Data revolution.

Up (2009 – Pete Docter, Bob Peterson) Up is a very touching and good movie and you could get lots of fun in the clouds scene, additionally, the volume of Big Data is like the elastic cloud infrastructure. The Big Data with the map-reduce paradigm could make the information technology problems solve on different cloud infrastructures.

The Elephant Man (1980 – David Lynch) “Hadoop” (named after the toy of Doug Cutting’s son, the creator) is a yellow elephant in the Big Data room. In the beginning, Hadoop is just a Google project and now it is a corner stone in the big data foundations.

Titanic (1997 – James Cameron) Titanic is a movie which shows a decision without enough analyzing the uncertainty to come out a sad result. Big Data could make you get more useful information which you could not see in your data before. With Big Data, you could see the “Iceberg” under the “water” and you will have a new view about your data and your decisions.

Minority Report (2002 – Steven Spielberg) Pre-crime department Anderson stops all the criminals before killing victims by predicting the future. Predictive analytics to predict what will happen in the future through analyzing data is the “killer application” of Big Data.

No Country for Old Men (2007 – E. & J. Coen) Big Data is a brand new skill and the old database “men” should make themselves to adjust to the new technology. In addition, Big Data means a lot of kinds of data and a large number of sources to collect data to make the decision makers find out the potential from “Big” data.

Big Fish (2003 – Tim Burton) Big fish is a movie which has a relationship with Auburn University. Big Data is so far like Ed Bloom’s shocking story which is told from his deathbed, sometimes, you could not make a clear decision about which is reality and which is a fairy tale with the reason that the fair tale may become the reality in the future.

Black Swan ( 2010 - Darren Aronofsky) A Black Swan is not only a role but also a theory to tell us that the rare events are hard to predict and the rare events may have a significant influence on the final results. Big Data could help you to see the rare events from the data, additionally, with analyzing data, you could predict the “Black Swan” and have a clear idea about how to do in the future.

Source:

1.http://www.thinkgig.com/entertaining-change-big-data-and-cloud-transform-film-industry/

2. http://bigdata-doctor.com/big-data-explained-in-10-movies/

Tutorial: Naive Bayes Classification Algorithm

Naive Bayes is a simple classification algorithm. Naive Bayes algorithm is based on the calculation effects of each criterion to results as probability. All of the potential datasets and newly added documents are used to calculate the possibility of each newly added term to affect the categories (Adsiz, 2006). By Bayes Theorem (Han & Kamber, 2001),

P(cj│d)=(P(cj)P(d│cj ))/(P(d)) ,

where P(cj) is the prior probability of category cj and d=(t1,…,tM ) is set of documents that is going to be classified.

Because of there are many features in datasets, it will be so difficult to calculate P(d│cj ). Therefore, in this algorithm, features in a document are considered as independent (Adsiz, 2006).

P(d│cj )=∏P(ti│cj ) , i=1,…,M ,

where ti is the i^th term in the document.

P(cj│d) is the probability of a document being in category cj

P(cj│d)=P(cj )∏P(ti│cj ) , i=1,…,M ,

where P(ti│cj ) is the conditional probability of term ti occurring in category cj. Also, it is an indicator of how much ti contributes to the category. In this method, we try to find the most appropriate category for the document. For this purpose, the best approach is the maximum a posteriori (map) category cj_map (Manning, Raghavan & Schütze, 2008).

cj_map=[argmax_cj ] P(cj│d)=[argmax_cj )] P(cj│d)=P(cj )∏P(ti│cj )

We do not know the exact values of parameters P(cj ) and P(ti│cj ). However, by using the maximum likelihood estimate (MLE) theorem, we can make estimations about these parameters. Let P ̃(cj ) be the approximate calculation of P(cj), and P ̃(ti│cj ) be the approximate calculation of P(ti│cj ).

P ̃(cj )=Nj/N ,

where N is the number of all documents, and N_j is the number of documents in category cj. And,

P ̃(ti│cj )=(1+Nij)/(M+∑ Nij), i=1,…,MN ,

where Nij is the number of documents belonging to category cj and including the i^th term ti.

1- Adsiz, A., (Ahmet Yesevi University ). (2006). Dissertation: Text Mining.

2- Han, J. & Kamber, M. (2001). Data Mining. Morgan Kaufmann Publishers, San Francisco, CA

3- Manning, C.D. , Raghavan, P. & Schütze, H. (2008). Introduction to Information Retrieval, Cambridge University Press.

Tutorial- Topsy

There are plenty of tools out there to analyze social media topics and trends. One that Chris and I have found especially helpful in analyzing these large amounts of data is called “Topsy”. He is probably better at understanding all of the cool things it can do, but I will give a short tutorial on the basic information it can provide.

**To be able to try this tutorial, you will need to create a trial account. It is free and it lasts for two weeks, so you’ll have plenty of time to be able to play around with it.

Once you have created your account and have logged in, you will see this as the main screen:

The first thing you will want to do is type in the terms that you want to search for in the bar at the top. After each keyword or phrase, hit enter. In this example, I will search three phrases: “data analysis”, “big data”, and “big data analytics”. To compare the three terms, I need to be sure that the check box beside each phrase (under the search bar) is checked. You will then be looking at your Dashboard. This feature gives an overview of the information Topsy has collected. The timeline shown is based on the last seven days, but you can choose a specific date range if you would like. In this case, my Dashboard looks like this:

On the Dashboard, you are able to see Tweet activity over time. It is easy to see that the phrase “big data” is a lot more prevalent on Twitter than the other two phrases that were searched. You are also presented with Top Tweets, Top Links, and Top Media.

If you click on the Geography tab at the top, you are able to see where the Tweets are coming from. Topsy is gathering most of its Tweets (at least about these topics) from Twitter users in the US. There are almost 25,000 tweets from the US, and the next closest is the UK with only about 4,000 tweets. You can see the breakdown of the tweets from around the world below:

You can even click on the United States link, and Topsy breaks down the tweets by state. Pretty amazing! Alabama doesn’t have much to say about these topics. There are only 43 tweets with these phrases from Alabama, whereas California has over 3,000. Big data must be a hot topic in Silicon Valley!

Now I am going to go back to the Dasboard. Right around March 29, I see a huge peak in the frequency of the term “big data”. Let’s find out why!

If you click on this peak, it leads me to the Activity tab, where I am able to see a list of the Top Tweets for March 29,I 23:00 about “big data”. It looks like this:

I am able to click on any links that may be connected to those tweets to read more about what was so popular about “big data” on this day. One of the main tweets that seems to have gotten the most action at this time was about how Doctors can use big data to improve cancer treatments. Since a link is attached to this tweet, I am able to check it out myself!

Like I said, Topsy is a great tool for analyzing the huge amounts of data found in the social media world. These are just a couple of the things that the program can do. I encourage you to check it out and find out more things that this tool can be used for!

PS. I focused mainly on Twitter data in this example, but it is my understanding that you can search Facebook, Tumblr, and Pinterest as well.