Industrial engineering students at Auburn University blog about big data. War Eagle!!
Sunday, March 31, 2013
Customer network value
In marketing area, customer has a life time value. This value could measure one customer's potential consuming ability, which means the profit could obtained form this customer. Now, as social network developing, customer all have network value, which a new part of customer value.
Network value is a value to measure one customer influence on other customers. To obtain this value, the most common way is data mining.
In the figure below, it show the social network connect by iPhone Different colors represent models of i Phones. The meat of this figure is that more and more people have been connected into social network
So, using social network to sell product or do market survey could be a new and effective way of marketing. The First step is find the customer with highest network value, and the method is data mining.
To perform data mining data collection is necessary. Usually, companies have their own social network pages. Such as, Fractal Design a computer case manufacturer has Facebook page and twitter account. Their customer could connect to their pages as fans or followers, and they the customer information is fetched by sellers. Then, they could find customers:
Have many connections,
Always talk about their products
Customer with these features are most like leaders of customers. Companies could focus on these kind of customer more. For instance, they find a customer always review their computer cases on his Facebook pages and use it to build systems, they could send some new and free products to him, then he could becomes a node of advertisement on the social network.
Ref: http://predictive-marketing.com/index.php/tag/social-network-analysis/
Improving soldiers' performance using Big Data
If the information about soldiers deployed in the battlefield is easily acquired, how much will the operation be improved? Equivital, UK based company, developed a wearable computer, called Black Ghost, that can sense the critical information about soldiers, like health status and location, and reply it to the headquarter. By monitoring heart rate, respiration or GPS data, the commander can know if a soldier's performance deteriorates over a certain period or he/she crosses the border.
The LifeMonitor together with Black Ghost provides data management and visualization of data. The big driver of this system is the ability to gather and centralize performance data from multiple soldiers over time. This allow better understanding of soldier and squad performance over time and how to improve it through optimized methods. Also, it helps with soldiers to quickly identify areas in the field that could leave them vulnerable to attack.
Reference: http://www.wired.co.uk/news/archive/2013-01/21/equivital-black-ghost
Crunching Big Data with Google Big Query
Ryan Boyed who is the developer advocate at Google and
focuses on Google Big Query presents first part of this video, and in the five
years at Google, he helped build the Google Apps ISV ecosystem. Tomer Shiran
who is the director of the product management team at MapR and is the founding
member of Apache Drill presents the second part of this video.
The developers have to face different kinds of data and a
large number of data, without good analyzing software and useful analyzing
methods, they have to use lots of time to collect a big amounts of data and
then throw some “invaluable” data, but most of time, the “invaluable” data has
their own potential value. Google has a good knowledge of Big Data with the
situation that every minute there are countless of users using Google’s
products such as Youtube, Google Search, Google+ or Gmail. With the big amounts
of data, Google begins to use APIs or other technologies to make the developers
focus on their fields. Google Big Query also known as Dremel is a Google
internal technology for Big Data analysis, and Apache Drill as Wikipedia says “Apache
Drill is an open-source software framework that supports data-intensive
distributed applications for interactive analysis of large-scale datasets.
Drill is the open source version of Google's Dremel system which is available
as an infrastructure service called Google BigQuery. One explicitly stated
design goal is that Drill is able to scale to 10,000 servers or more and to be able
to process petabytes of data and trillions of records in seconds. Currently,
Drill is incubating at Apache.” Apache Drill can make users query terabytes of
data in seconds and can support Protocol Buffers, Avro and JSON data formats,
with using data sources, it can use Hadoop and HBase. MapR Technologies is the
open enterprise-grade distribution for Hadoop, which is easy, dependable and
fast to use, and is the open source with standards-based extensions. MapR is
deployed at one thousand’s of companies from small Internet startups to the
world’s largest enterprises. MapR customers analyze massive amounts of data
including hundreds of billions of events daily, data from ninety percent of
world’s Internet population monthly and data from one trillion dollars in
retail purchases annually. MapR has partnered with Google to provide Hadoop on
Google computer engine. Drill execution engine has two layers which are
operator layer and execution layer, the operator layer is serialization-aware
to process individual records and execution layer is not serialization-aware to
process batches of records to be responsible for communication, dependencies
and fault tolerance. MapR can provide the best Big Data processing capabilities
and is the leading Hadoop innovator.
Sources:
http://en.wikipedia.org/wiki/Apache_Drill
At the Intersection of Biology and Technology
As big data increases its importance, the companies have
started to explore the new ways to use it.
Smart
companies are gathering massive amounts of data and correlating it with other
sources to produce new insights. This is where big data and big data
analytics come in. Big data is growing into a catalyst for change on a global
scale with, seemingly, limitless possibilities.
The convergence
of biotechnology and bioinformatics is providing a great advantage to the companies
to gather and analyze data as well as what they can learn from that data.
MC10 and
Proteus are the companies which use both “wearable” technologies and digestible
microchips to gather and analyze information about processes like brain
activity and hydration levels that they intend to use for noble causes like
lowering costs and increasing levels of care.
Sano Intelligence also plans to use these wearable devices to “capture and transmit” blood chemistry
information continuously to an analysis platform to capture the information from the the human body.
Although the
debate continues on how individual physiological data can be legally and
ethically used, smart brains are applying new technology to reveal the
information underlying the massive amount of data.
http://bits.blogs.nytimes.com/2012/09/07/big-data-in-your-blood/
Hadoop is Old News
Even though Hadoop may be all the rage right now and
expected to be the centerpiece of a billion dollar section of the software
industry within the next few years, the tech that Hadoop is founded on has
already been replaced within Google. Hadoop is the open source software based
on two Google research papers that discuss two pieces of closed source Google
software, MapReduce and the Google File System. These papers were published
almost 10 years ago, an eternity in the fast paced technology market, and
Google began to phase out usage of those two pieces of software with new tech
in 2009. Since then Google has used research papers to detail some of their
newer tech. For instance, Google has detailed the platform that creates the
index for Google Search, Caffeine, as well as Pregel, a graph based database
that is used to map complex relationships for the vast amount of information
that Google stores. Dremel, however, appears to be the most intriguing piece of
technology that Google has detailed.
Dremel essentially does what many
third parties are trying to do with Hadoop, in that it allows queries similar
to SQL of massive amounts of data spread out across thousands of servers very
rapidly. Google goes so far as to claim that you can run queries on petabytes
of data in a matter of seconds as opposed to minutes or even hours it would
take for Hadoop to run a similar feat. According to Google, Dremel can run the
type of queries that would take numerous MapReduce tasks in a fraction of the execution
time, taking just three seconds to run a query on a petabyte of data. This is
an amazing and extremely important accomplishment. With Hadoop you trade speed
and response for the ability to analyze massive amounts of data, but with
Dremel there would be no trade off. In a very similar way to how the
open-source community spawned Hadoop after the release of papers on MapReduce,
there is already a team of engineers working on an open-source variant of
Dremel aptly named OpenDremel. OpenDremel appears to be a very long way from
functionality though and it seems less worthwhile since Google now offers
BigQuery, a service in which you can use Dremel on your own data.
Sources:
https://developers.google.com/bigquery/
Using Data Mining Techniques to Predict the Survival Rate For Heart Transplants
This week I will continue to share my research with my classmates.
As I mentioned we focus on predicting how many years can a specific person live with the donated new heart.
In order to solve this problem , the first and the main problem is to determine the factors ( variables) affecting the result.
Conventionally, researchers have been dealing with small sets of dataset with using conventional statistical techniques which does not take collinearity and the nonlinearity into account, as it was discusses in the previous blog. They use some non-parametrical and non-statistical techniques that are computationally expensive and need prior knowledge about the data .
The biggest advantage of todays world is there is a flood of big data in the health informatics that can be dealt with data mining techniques, which reveal better and more accurate solutions for the survival of organ transplant recipients than any of the conventional methods used by previous studies.
We had started to do the research by obtaining a very large dataset from UNOS, which is a tax-exempt, medical, scientific, and educational organization that operates the national Organ Procurement and Transplantation Network.The obtained dataset has 443 variables and 43000 cases which belong to the Heart Transplant Operations. These variables include the socio-demographic and health-related factors of both the donor and the recipients. There are also procedure-related factors among the dataset.
After preprocessing the data ( cleaning, dealing with the missing values, reorganizing the data for the specific studies etc), we used variable selection methods in order to determine the potential predictive factors.
These potential predictive factors are the ones which have questioned whether they are predictive or not by using some data mining algorithms such as Support Vector Machines, Decision Trees and Artificial Neural Networks.
After doing cross-tabulation and doing sensitivity analysis , we observed that all of these three methods gave pretty satisfactory results.
For 3 Years survival study, Support Vector Machines gave the best prediction rate by predicting 94.43 % of the cases correctly, while artificial Neural Network 81.18 % and Decision Trees 77.65 % of them correctly.
What do these results mean ?
For support vector machine, the accuracy rate is 94.43 % , which means if Support Vector Machine is telling us that a specific person will live or die if he/she gets the donated organ , it is 94.43 % correct.But it has 6.57 % of chance to fail to predict.
It also lets us know which factors are playing a role to predict these results.
These results are pretty high results which have not been reached by using the conventional statistical techniques which is pretty promising for the future success of the heart transplants in the future.
The Mental Approach to Baseball Hitting-Big Data to Analyze Hitters Brain Function
Considered one of the most difficult tasks in sports,
hitting a thrown baseball, especially at the professional level is something
only the most gifted athletes on the planet can do. The issue is that
performing this task requires a complex interaction between the brain and
muscles in the body. Even the most physically gifted athletes are unable to hit
a baseball if their mental prowess is subpar. According to this article, http://baseballanalysts.com/archives/2009/09/unraveling_the.php,
a professional-level batter has approximately 50 msec (.05 sec) to react after
a pitch is thrown in order to hit it. After that .05 sec, the batter is not
able to alter his swing in any way from what he has decided to do. For
comparison purposes, the average human eye blinks between 300 and 400 msec.
This means the batter must decide whether to swing or not anywhere from 6 to 8
times faster than someone can blink, not an easy task. Making things even more
difficult is that most pitchers throw three or four different pitches, many of
which move in the air. So now, the batter must identify the type of pitch,
decide whether it is a ball or strike and send electrical signals to their muscles
to react in time in order to successfully hit it. No wonder failing 7 times out
of 10 is considered an elite level of hitting performance (a .300 average). The
paper at this link from the 2013 Sloan Sports Conference details research done
on the subject of batter brain function in determining pitches: http://www.sloansportsconference.com/wp-content/uploads/2013/02/A-System-for-Measuring-the-Neural-Correlates-of-Baseball-Pitch-Recognition-and-Its-Potential-Use-in-Scouting-and-Player-Development.pdf
The study was done using three Division 1 college baseball
players. Each player looked at 468 simulated pitches and was asked to identify
the pitch type using a keyboard as soon as the simulated pitch was thrown. An
fMRI and EEG scanner were used to study the subjects’ brain activity while they
were identifying pitches. A linear equation was formulated to try and determine
which independent variables were related to the time it took to recognize a pitch
and whether it was correctly identified or not. The brain scans were used to
evaluate brain activity in different areas as time passed after the pitch was “thrown”.
The studies found that for all pitch types, brain activity peaked around 400
msec and 900 msec. As a reference, 400 msec would be approximately the time the
pitch would cross the plate at a normal pitch speed, while the researchers
speculated the second peak was a type of post-decision thinking about their
choice. This study found that different regions of the brain are active for
different pitch types and different regions are active for incorrect vs.
correct pitch identification within a pitch type group (an example being one
area is active for a correct fastball ID while another is active for incorrect
fastball ID).
Some application of this research may lie in the scouting
future of baseball. Players are long coveted who have the physical tools
necessary to hit a baseball, but perhaps this research could better enable
scouts to find out who has the mental abilities to identify pitches properly,
which is key in hitting the ball. The researchers also hypothesize the
information could be used in scouting reports by being able to identify what
pitches batters are bad at recognizing. If a team knows their hitter is bad at
recognizing curveballs mentally, they could work with them to try and correct
that. Conversely, if a pitcher knows a batter has difficulty recognizing a
pitch he could try to use that particular pitch more often or in a particular
situation to try and get them out. The amount of data which could be generated
by this research is vast and untested, but it could have an important impact on
how the value of baseball players is determined in future.
Saturday, March 30, 2013
More on the Motion Chart
I posted a motion chart regarding percent GDP spend on
military and GDP per capita. While I discussed a little of what caught my eye
at first glance on the visualization, there is more that I wanted to briefly
address. I removed a large portion of the countries in order to clean it up a
little. I kept all European and North American nations, Japan, New Zealand,
Australia, and South Korea. Two topics I want to discuss about these nations
are the economic differences between eastern and western European nations, and
the similarities in the economies of Canada, Australia, New Zealand, and South
Korea.
Below is the image of the chart described above. You can see the western European (represented by blue), eastern European (yellow-green), North American (light green), Australia and New Zealand (yellow), and Japan and South Korea (red). As you can tell, with the exception of the United States, that Australia, New Zealand, Japan, and Canada all not only lie within the western European nations, but have almost identical reactions. So much so that if indicated by the same color, would be indistinguishable. This can be attributed to how closely their economies rely on the same variables. Next I am going to try and create a similar chart going back decades to observe how much slower nations react to economic changes on different continents.
When watching the motion chart, the differences between
eastern (represented by blue in the image below) and western European nations (represented by red), in terms of GDP per capita, are astonishing. While it is probably no surprise that there is a difference, I
know that I would not have expected as defined clusters as the two seen throughout
the sixteen years of data. What causes this difference? Much can be attributed to decades of communism in the eastern nations. While they have moved from this form of economy, the effects are obviously still seen today.
It would be expected that over the next few decades that these eastern European nations will begin to migrate up the chart and join those the western Europe.
To see the motion chart of 130 nations see my visualization from last week.
How Google Search fights Spam
In class we discussed on how Google's search program works and how it was better than its predecessors due to its ability to find the most relevant web pages based on what you were searching for. But as always, there are people who are going to try and cheat the system. These people are referred to as spammers, which are people who try to get their unrelated website to come up in any search usually in order to try and push some product on the user.
There are three many ways spammers try to beat the search engine.
1. Cloaking- We talked about this in class. This is the practice of putting the searched for word in the same color as the background thereby hiding it from the user on the site, but still having a numerous amount of the searched word that the search engine will read and therefore think the site is relevant.
2. Keyword Stuffing- This is similar to cloaking. This is when a website plasters a numerous amount of the keyword on the website, usually at the bottom of the page, in order for it to get the search engine to believe it is relevant to the search.
3. Paid Links- This is when a website pays other websites to link to its page in order to increase its PageRank, which we discussed in class is how Google works by finding the importance of the webpage based on the "votes" by links on other webpages.
The paid links are a little harder to discover, but usually if a site has been selling links Google will no longer trust the links from that page.
Source:
http://www.google.com/competition/howgooglesearchworks.html#section4
There are three many ways spammers try to beat the search engine.
1. Cloaking- We talked about this in class. This is the practice of putting the searched for word in the same color as the background thereby hiding it from the user on the site, but still having a numerous amount of the searched word that the search engine will read and therefore think the site is relevant.
2. Keyword Stuffing- This is similar to cloaking. This is when a website plasters a numerous amount of the keyword on the website, usually at the bottom of the page, in order for it to get the search engine to believe it is relevant to the search.
3. Paid Links- This is when a website pays other websites to link to its page in order to increase its PageRank, which we discussed in class is how Google works by finding the importance of the webpage based on the "votes" by links on other webpages.
The paid links are a little harder to discover, but usually if a site has been selling links Google will no longer trust the links from that page.
Source:
http://www.google.com/competition/howgooglesearchworks.html#section4
smart grid and data mining (2)
Speaking of smart grid, power engineers and energy companies are pretty excited, not only because it will bring in a new technique revolution but also it is going to involve a huge number of money--- Only about $356 million of it today, but potentially $4.2 billion of it by 2015. Which is meaning that it will reach a cumulative $11.3 billion between 2011 to 2015. That’s what Pike Research predicts for the global market for smart grid data analytics, or software and services that can mine data and provide intelligence for smart grid vendors, utilities and consumers.
As a result, most utilities around the world have to face up to new problems: how to deal with a flood of smart grid data in the upcoming years, they will also need to mine that data to find ways to cut costs, improve customer adoption and better predict future power needs. In a sense, how well the utilities deal with the above challenges will affect the destiny of the whole smart grid industry.
There is no doubt about that applying the smart algorithms and applications of the Internet industry to the smart grid could generate a host of new ways of doing business. On the utility operations side, smart meters and distribution automation systems are supposed to be data-mined to optimize the flow of power or predict when equipment is most likely to fail. On the customer end, behavioral data and market analysis can also be applied to entice more and more people into energy efficiency programs, or help them choose which energy-efficient appliances to buy.
A host of IT giants are already involved in smart grid data analytics, including Accenture, Capgemini, HP, IBM, Microsoft, Oracle, SAIC, SAP and Siemens among them. Smaller, newer entrants include OPOWER, OSIsoft, Telvent, Ecologic Analyticsand eMeter.
Utilities are also seeking help in upgrading customer relationship management to handle the shift from monthly power bills to daily or hourly interactions via smart meters. As for concerns over home energy data security and privacy, Pike predicts that smart grid IT players such as Cisco, IBM, Microsoft and Oracle will play an important role.
reference:
http://gigaom.com/2011/08/15/big-data-meets-the-smart-grid/
Sports fraud detection using data mining techniques
People are loving sports in the states. Also people hate fraud activities in sports area. It is believed that too many sports frauds will lead to losing popularity and audiences, and eventually destroy the whole sport. However, for some reasons, fraud in sports has always been existing in reality. Therefore, it is essential for people to find out ways to detect sports fraud activities.
Before introducing sports fraud detection, I’d like talk a little bit about sports fraud. According to people’s observations, there are mainly three categories of fraudulent activity in sports: poor player performance, a pattern of unusual calls from the referee, and lopsided wagering. For the previous two, basically they are trying to manipulate the betting line. There are several cases have been reported by media. A recent example of this was in the summer of 2007 when NBA referee Tim Donaghy was investigated and convicted for compromising basketball games to pay off gambling debts. Lopsided wagering can be used as an indicator of a compromised game. This type of wagering could involve betting in excess of what is normally expected or betting heavily against the favorite.
As you might know, Las Vegas Sports Consultants Inc (LVSC), which sets betting lines for 90% of Las Vegas casinos, is one of the organizations that actively looks for fraudulent sports activity. What The LVSC statistically analyzes are both betting lines and player performance in order to look for any unusual activity. Player performance is judged on a letter-grade scale (i.e., A-F) and takes into account variables such as weather, luck and player health. Taken together, games are rated on a 15 point scale of seriousness. A game rated at 4 or 5 points may undergo an in-house review, 8-9 point games will involve contact with the responsible league. Leagues are similarly eager to use the services of LVSC to maintain game honesty. The LVSC counts several NCAA conferences, the NBA, NFL, and NHL, as some of its clients.
As practice, Las Vegas Sport Consultants are not the only gambling institutions with an interest in honest and fair sports events, offshore betting operations are starting to fill this role as well. One popular offshore gambling site, Betfair.com, has signed an agreement with the Union of European Football Associations (UEFA) to help monitor games for match-fixing, unusual results, or suspicious activity.
Big Data and Film
After cloud computing changes people’s mind in business, the
Big Data comes out and changes business operation and analysis method to
increase the efficiency and mobility. Lots of different industries get benefits
from Big Data, and the entertainment business also takes Big Data into operation,
such as analysts can use the amount of nominations, wins in different awards
shows, and information from betting websites data to do some researches with
using Big Data to predict the major award winners in some specific award ceremony;
music executives through analyzing data to acquire the customers’ listening
habits and musical inclinations, and they also can use data analysis to decide
the position and the kinds of musicians to give their shows; and movie studios
and distributors or other movie organizations can use Big Data to analyze data
and then to make decisions about the promotion methods, releasing film dates
and releasing locations to make sure that the films could make profits.
Now we will see ten movies that could help us to get
familiar with Big Data through an interesting way.
V for Vendetta (2005 - James McTeigue) Because of V-Trinity,
which stands for Velocity, Volume and Variety, V means more than Vendetta in
Big Data, in addition, Velocity means real time processing, Volume means getting
useful and enough information through a large amount of data, and Variety means
to use and to relate different kinds of data to help make the final decisions.
The Fast and the Furious (2001- Rob Cohen) Keep your eyes on
your data and make them to “speak out” useful information. You could use Big
Data to analyze your business and predict which decision could make a
difference and which decision could make you bankrupt, and you could imagine
that you exploit data to get knowledge whether to be fail or to be successful
in a high speed, with the data driven culture, you could have the idea of dying
soon or triumphing soon. Fail fast or
you are going to get furious!
The Gold Rush (1925 - Charlie Chaplin) The data is like the
new mine, and a large number of companies want to exploit the potential of the
data to get more and more information to help them make money in business, but
it is very difficult to build up the data driven culture in the organizations
which is like Chaplin’s hard trip in Alaska. If the organizations want to have
an easier and happier way to take the revolution, they should know how to avoid
the fatal mistakes in Big Data revolution.
Up (2009 – Pete Docter, Bob Peterson) Up is a very touching
and good movie and you could get lots of fun in the clouds scene, additionally,
the volume of Big Data is like the elastic cloud infrastructure. The Big Data
with the map-reduce paradigm could make the information technology problems
solve on different cloud infrastructures.
The Elephant Man (1980 – David Lynch) “Hadoop” (named after
the toy of Doug Cutting’s son, the creator) is a yellow elephant in the Big
Data room. In the beginning, Hadoop is just a Google project and now it is a corner
stone in the big data foundations.
Titanic (1997 – James Cameron) Titanic is a movie which
shows a decision without enough analyzing the uncertainty to come out a sad
result. Big Data could make you get more useful information which you could not
see in your data before. With Big Data, you could see the “Iceberg” under the “water”
and you will have a new view about your data and your decisions.
Minority Report (2002 – Steven Spielberg) Pre-crime department
Anderson stops all the criminals before killing victims by predicting the
future. Predictive analytics to predict what will happen in the future through
analyzing data is the “killer application” of Big Data.
No Country for Old Men (2007 – E. & J. Coen) Big Data is
a brand new skill and the old database “men” should make themselves to adjust
to the new technology. In addition, Big Data means a lot of kinds of data and a
large number of sources to collect data to make the decision makers find out
the potential from “Big” data.
Big Fish (2003 – Tim Burton) Big fish is a movie which has a
relationship with Auburn University. Big Data is so far like Ed Bloom’s
shocking story which is told from his deathbed, sometimes, you could not make a
clear decision about which is reality and which is a fairy tale with the reason
that the fair tale may become the reality in the future.
Black Swan ( 2010 - Darren Aronofsky) A Black Swan is not
only a role but also a theory to tell us that the rare events are hard to
predict and the rare events may have a significant influence on the final
results. Big Data could help you to see the rare events from the data,
additionally, with analyzing data, you could predict the “Black Swan” and have
a clear idea about how to do in the future.
Source:
2. http://bigdata-doctor.com/big-data-explained-in-10-movies/
Tutorial: Naive Bayes Classification Algorithm
Naive Bayes is a simple classification
algorithm. Naive Bayes algorithm is based on the calculation effects of each
criterion to results as probability. All of the potential datasets and newly
added documents are used to calculate the possibility of each newly added term
to affect the categories (Adsiz, 2006). By Bayes Theorem (Han & Kamber,
2001),
P(cj│d)=(P(cj)P(d│cj
))/(P(d)) ,
where P(cj)
is the prior probability of category cj and d=(t1,…,tM ) is set of documents
that is going to be classified.
Because of there are many features in
datasets, it will be so difficult to calculate P(d│cj ). Therefore, in this
algorithm, features in a document are considered as independent (Adsiz, 2006).
P(d│cj )=∏P(ti│cj
) , i=1,…,M ,
where ti is
the ith term in the document.
P(cj│d) is the probability of a document being in category cj
P(cj│d)=P(cj
)∏P(ti│cj ) , i=1,…,M ,
where P(ti│cj
) is the conditional probability of term
ti occurring in category cj. Also, it is
an indicator of how much ti contributes to the category. In this method, we try
to find the most appropriate category for the document. For this purpose, the
best approach is the maximum a posteriori (map) category cj_map (Manning,
Raghavan & Schütze, 2008).
cj_map=[argmax_cj
] P(cj│d)=[argmax_cj )] P(cj│d)=P(cj )∏P(ti│cj )
We do not
know the exact values of parameters P(cj ) and P(ti│cj ). However, by using the
maximum likelihood estimate (MLE) theorem, we can make estimations about these
parameters. Let P ̃(cj ) be the approximate calculation of P(cj), and P ̃(ti│cj ) be the approximate
calculation of P(ti│cj ).
P ̃(cj )=Nj/N ,
where N is
the number of all documents, and N_j is the number of documents in category cj.
And,
P ̃(ti│cj )=(1+Nij)/(M+∑ Nij), i=1,…,MN ,
where Nij is
the number of documents belonging to category cj and including the ith
term ti.
1- Adsiz, A., (Ahmet Yesevi University ). (2006). Dissertation: Text Mining.
2- Han, J. & Kamber, M. (2001). Data Mining. Morgan Kaufmann
Publishers, San Francisco, CA
3- Manning, C.D. , Raghavan,
P. & Schütze, H. (2008). Introduction to Information Retrieval,
Cambridge University Press.
Tutorial- Topsy
There are plenty of tools out there to analyze social media
topics and trends. One that Chris and I have found especially helpful in
analyzing these large amounts of data is called “Topsy”. He is probably better
at understanding all of the cool things it can do, but I will give a short
tutorial on the basic information it can provide.
**To be able to try this tutorial, you will need to create a
trial account. It is free and it lasts for two weeks, so you’ll have plenty of
time to be able to play around with it.
Once you have created your account and have logged in, you will
see this as the main screen:
The first thing you will want to do is type in the terms
that you want to search for in the bar at the top. After each keyword or
phrase, hit enter. In this example, I will search three phrases: “data
analysis”, “big data”, and “big data analytics”. To compare the three terms, I
need to be sure that the check box beside each phrase (under the search bar) is
checked. You will then be looking at your Dashboard. This feature gives an
overview of the information Topsy has collected. The timeline shown is based on
the last seven days, but you can choose a specific date range if you would
like. In this case, my Dashboard looks like this:
On the Dashboard, you are able to see Tweet activity over
time. It is easy to see that the phrase “big data” is a lot more prevalent on
Twitter than the other two phrases that were searched. You are also presented with Top Tweets,
Top Links, and Top Media.
If you click on the Geography tab at the top, you are able
to see where the Tweets are coming from. Topsy is gathering most of its Tweets (at
least about these topics) from Twitter users in the US. There are almost 25,000
tweets from the US, and the next closest is the UK with only about 4,000
tweets. You can see the breakdown of the tweets from around the world below:
You can even click on the United States link, and Topsy
breaks down the tweets by state. Pretty amazing! Alabama doesn’t have much to
say about these topics. There are only 43 tweets with these phrases from
Alabama, whereas California has over 3,000. Big data must be a hot topic in
Silicon Valley!
Now I am going to go back to the Dasboard. Right around
March 29, I see a huge peak in the frequency of the term “big data”. Let’s find
out why!
If you click on this peak, it leads me to the Activity tab,
where I am able to see a list of the Top Tweets for March 29,I 23:00 about “big
data”. It looks like this:
I am able to click on any links that may be connected to
those tweets to read more about what was so popular about “big data” on this
day. One of the main tweets that seems to have gotten the most action at this
time was about how Doctors can use big data to improve cancer treatments. Since
a link is attached to this tweet, I am able to check it out myself!
Like I said, Topsy is a great tool for analyzing the huge
amounts of data found in the social media world. These are just a couple of the
things that the program can do. I encourage you to check it out and find out
more things that this tool can be used for!
PS. I focused mainly on Twitter data in this example, but it
is my understanding that you can search Facebook, Tumblr, and Pinterest as
well.
Subscribe to:
Posts (Atom)