Analytics and Visualization of Big Data

Tuesday, April 2, 2013

Siri and Sentiment Analysis


Image from parent article

     Siri, Apple's iPhone/iPad "genie", lets you use your voice to send messages, schedule meetings, place phone calls, emails, etc. According to Apple, Siri not only understands what you say, it’s smart enough to know what you mean. So what if Siri was used in partner with sentiment analysis to determine the emotional tone in text messages, emails, or social media posts? If Siri could tell you if your message could be construed as having an unintended negative underlying message, or if Siri could serve as a questioning firewall before posting an negative post that could go viral, would it help protect you from potentially severe reputational damage? Rado Kotorov wonders if this implementation of sentiment analysis will help make the world a better place with less conflict and argument, or if it just would leave piles of unread messages in your inbox.
     So how can a machine predict the emotions of messages? This is a challenging question, given that even humans are not entirely perfect in the art of discerning other humans' emotions. Sentences can contain subtleties in various phrases in differing languages, and short answers can also probe problems with sentiment analysis. Sentiment analysis typically uses a way of "scoring" a phrase by the amount of negative or positive words; there is hardly any room for sentiment analysis to determine the meaning behind phrases that mean the opposite of just the words themselves (sarcasm).
     In my opinion, if this technology were to be implemented, sentiment analysis would need to advance to the point of determining nuances in colloquial language and the nature behind how humans communicate in order to be effective. Parent article here.

Facebook and Data mining

If you haven’t fled Facebook for Google+ or abandoned social networks entirely, you probably–like me–have a lot invested in the platform. A new feature is in beta on Facebook: Graph Search. If you get through the waiting list to try it out, you’ll find lots of options for targeted searches centered on your social network. Graph search works by linking together terms and restrictions to allow for very specific searches within the network: you can look for images from friends based on a common location or subject, or find everyone in your social network who went to the same university and are fans of Glee. Is it useful? The possibilities for networking–from finding local friends who share a passion for running to gathering info on a potential new campus to making connections at a company–are immediately clear. But it’s also a powerful (and perhaps alarming) data mining tool that puts front and center just how much data some of us have committed to this social network already.

Those with access to the new search mechanism have already created a stir with sites such as Tom Scott’s “Actual Facebook Graph Searches,” which includes several juxtapositions of targeted search queries that could reveal everything from personally embarrassing information to illegal acts within certain countries. Of course, targeted Google searches or just a quick browse of an ill-considered profile can be equally as revealing, but there’s an alarming efficiency to this new method of data-mining within the social network. The availability of all this data is definitely going to lead to some tense Institutional Review Board debates, as it offers an easy way for all of us to see some of the incredible marketing and interest data that Facebook has been amassing on its users. It could certainly be a fertile ground for social research–but are all Facebook users really clear on how much information they’re sharing?

The introduction of Graph Search makes this an important time to revisit privacy settings: EFF has broken down some of the new implications. Check out Brian’s essential steps to checking Facebook privacy to get started. Searches through images can be particularly hard to control, as it pulls images from everyone’s albums and your friends might not have the same standards for privacy as you do. This next iteration of social search is also another opportunity to talk with students about their digital identities and privacy choices. I’ll definitely be taking a look at its ramifications in my digital communication learning community course this semester, as it shows how easy it is to pull personal information out of the noise of social media.

Reference:

http://actualfacebookgraphsearches.tumblr.com/

Monday, April 1, 2013

Big Data vs. Big Applications of Small Data

Big data is awesome and has done a lot of great things for businesses, as we have seen in a lot of the previous blog posts. We have become obsessed with big-data and the idea that it will translate to profitable businesses. One of the reasons that big data has become so popular is because it promises a new age of science to decision-making and business reasoning.

Despite all of these promises, it is still unclear how to move from vision to reality. However there is another type of data that is easier to deal with but can still give a lot of useful information: small data applied in big ways. Two different businesses have capitalized on this idea and have profited greatly from it. One business is Burberry, which is a luxury brand, and the other is Caesers, which holds stakes in hotels, spas, and casinos.

Burberry started using data in business by simply creating a better experience for its customers. It put screens to connect shops to the head office, installed audio-visual customer information screens, and gave the staff iPads. This led to the company to using social media to engage its audience and cross-reference the information gathered from in-store purchases to the iPads that the staff had. This is all considered small data but that is used in a big way.

Caesars Entertainment started with big data analytics long before Burberry started using small data. However, they were unsatisfied with the lack of small details in the big data that they had previously acquired. They decided to revamp the customer experience and followed customers from search until they appeared on Caesears property. Caesars has expanded that so that customers with a losing streak receive a personalized voucher. The company has learned when it is best to give credit to a customer and when it is best no to do so.

Burberry and Caesars, coming from different ends of the data spectrum, both ended up in the same place: focusing on small data in order to make sure the customer comes first. This has led to profit increases for both companies and Burberry became the fastest-growing luxury brand in 2012. So even though this class is focused on big-data, we must not forget that small data applied in a big way can be just as valuable.

Source:
http://www.forbes.com/fdc/welcome_mjx.shtml

Big Data's biggest challenge? Video Games..

I'll start with a few facts in order to convince you of the enormity of information held in gaming.

1) over the last decade the video game industry has grown from 200 million consumers to 2 billion consumers.

2) "Battlefield 3:" 1 TB of data is generated daily (this data includes kills, deaths, shots, explosions, etc.)

3) More simple games such as "The Simpsons: Tapped Out" generates 150 GB daily and 4.5 TB Monthly

4) In a typical month over two and half billion sessions are held over the entire platform of games (~ fifty billions minutes of game play.)

The reason the gaming platform finds big data so interesting is due to its evolution from paid to play games to free to play games. Many gaming companies are changing their business model from an initial one time payment to free to play in hopes that you will spend money to receive in game help (ie better weapons, new characters, different maps, etc..) They find that the relationship between prices and purchases or limitations and purchases are vital in maximizing their profits. Rajat Taneja states that "the right data at the right time is more valuable than bigger data sets." Since there is so much data being generated daily, descriptive data of one day may be more telling than descriptive data from the whole week. By looking over a small data set you may over look an opportunity to enhance the consumer's gaming experience.

Reference: http://www.youtube.com/watch?v=ZK_PXlbvOfM

Music and Big Data

Data is everywhere, and most of the time we don't even realize it. Digital stores like itunes would actually provide analysts in the music industry with insights into listening habits, preferences and a whole host of other data that could be used to understand the music consumer better. Information like this is being used to drive the type of songs and the type of branding that goes into channels such as online radio.
For starters, there's this company called 'The Echo Nest' who have actually spent over half a decade analyzing over 30 million songs spread over 2 million artists. This information is being used to populate data points that are driving a whole new era of music, dubbed 'fanalytics'.Companies like Echo Nest are even analyzing the tempo, pitch and other aspects of songs and relate them to cultural habits and other behavioral and consumption data to help better understand what sells and what does not.
Like all other industries before it, the music industry turned towards data analytics to maximize the avenues of profit. The technology being used as a 'recommendation engine' for music, is very similar to what one saw when they bought books from Amazon.com. Recommendations would be made for similar books, or similar customer profiles, hence giving the buyer more avenues to explore.
One of the ways where Big Data tools can help record labels is in understanding their local customers better is with regards to live concerts. Big Data tools can help organizers understand which cities to do shows in, which artists to promote, what songs are trending in those cities, what avenues to use to promote concerts and much more. This may sound like Big Data is just helping record labels fill in their yearly planner better, but if done correctly, it can help artists know their audience better, plan their shows better and ultimately make more money. Lady Gaga's gross profit from concerts alone was over $225 million in 2011. Her manager admits to being a Big Fan of Big Data as it is helping them plan the stage and set list based on the mood of the city they have concerts in, ensuring a better turnout and response. Music artists are already using 'fanalytics' to give them the next level of understanding of their listeners. This information is already being used in a whole bunch of activities including merchandising, concert-planning and even song writing.

Sources: http://venturebeat.com/2012/02/17/music-hack-day/
http://venturebeat.com/2012/02/17/music-hack-day/
http://www.webpronews.com/the-echo-nest-where-big-data-is-taking-the-music-industry-2012-04

How businesses, criminals, and governments track what you're doing

There have been a lot of cases in the news about identity theft which has made a lot of people wary of criminal activity. What people aren't realizing is that there is also a bigger privacy issue growing but not with criminals, with companies like Google, Amazon, and Facebook. At the moment, there aren't a lot of laws and regulations that control what these businesses can collect data-wise. Predictive analytics of big data are is a tool that is giving governments and businesses the capability of being even more intrusive.

The data being used can be gathered using point-of-sale systems, mobile devices, cameras, microphones, internet searches, and online tracking technologies. One example is the detailed transactions that are saved by retailers such as Wal-Mart and Target. Another example is "likes" and "shares" on Facebook. Even your searches in Google are saved. Statistics, modeling, and data mining are only some of the tools being used to analyze the huge amount of data that people in the US give off every single day. Targeting customers is the primary objective of this analysis.

Target is very popular targeting customers, especially new moms. A story about this was discussed in an earlier blog post. Another application is dynamic pricing. Prices can be changed based on a algorithm that estimates customers' willingness to pay. This is especially useful and accurate to online retailers who have more personal information, purchasing information, and how many times a person has looked at a product. This gives retailers a much better idea of how much customers are willing to pay for certain items.

Some examples of patterns and correlations discovered by dig data are:

Facebook "likes" revealing political and religious views, drug use, marital status, and sexual orientation
Blue Cross/Blue Shield buys shopping data.

If a person buys plus-size clothing, the plan could flag them for potential obesity and then even higher healthcare costs

President Obama's 2012 campaign used datasets to identify Republican- leaning voters who might be persuaded by specific issues.

What makes all of this so interesting is that there aren't very many regulations so who knows where this could go if the government and businesses continue to be unregulated in what they can and cannot collect.

Source: http://www.post-gazette.com/stories/opinion/perspectives/big-data-is-watching-you-681554/

The potential career link between big data and Wall Street

It has been a long known secret that Wall Street will hire from fields like rocket science trying to find skills in modeling, advanced mathematics, and analysis. But did you know that there is now a job field that Wall Street uses to describe these data and mathematics driven analysts and that it is directly related to big data. Their called Quants. That's short for qualitative analyst.

I thought that since we are getting close to the end of the semester, that I would share about this field as it uses big data in a very finically rewarding career. It turns out that most large Wall Street investment firm will employ Quants as analyst who build models for understand markets, securities and instruments based on large data sets of past market performance. And these analysts have become so integral to Wall Street that graduate degree programs have been created specializing in training Quan's. here are the links to two of these programs: the first is at South New Hampshire University (http://www.snhu.edu/online-degrees/graduate-degrees/MBA-online/quantitative-analysis.asp) and the second is at UC Berkeley (http://extension.berkeley.edu/spos/quantitative.html).

If you were to read the Wikipedia page that describes Quants, it has a marked similarities to many of the things that describe Industrial Engineers only it focuses solely on the uses of the techniques to build predicative and descriptive models involving different types of finances. Quants uses tools like Monte Carlo simulations, stochastic modeling, and time series analysis to inform investors and portfolio managers on different areas of finance.

Gavin posted last week about High Frequency Trading. The foundation of High Frequency Trading is models and algorithms made by Quants. I know that there are undergraduates in the class that are considering an MBA and taking their IE knowledge to the business world, so I thought that I would share this in case anybody was interested in a career using data mining in business.

Data mining could predict heart attack risks!

A team of researchers has used data mining techniques to find subtle changes in electrical activity in the heart that can be used to predict potentially fatal heart attacks.

Researchers from the University of Michigan, MIT, Harvard Medical School and Brigham Women’s Hospital in Boston sifted through 24-hour electrocardiograms (which measure the electrical activity in the heart) from 4,557 heart-attack patients to find errant patterns that until now had been dismissed as noise or were undetectable.

They discovered several of these subtle markers of heart damage that could help doctors identify which heart attack patients are at a high risk of dying soon. Electrocardiograms (ECGs) are already used to monitor heart attack patients, but doctors tend to look at the data in snapshots rather than analyze the lengthy recordings.

The team developed ways to scan huge volumes of data to find slight abnormalities — computational biomarkers — that indicate defects in the heart muscle and nervous system. These included looking for subtle variability in the shape of apparently normal-looking heartbeats over time; specific sequences of changes in heart rate; and a comparison of a patient’s long-term ECG signal with those of other patients with similar histories.

They found that looking for these particular biomarkers in addition to using the traditional assessment tools helped to predict 50 percent more deaths. The best thing is that the data is already routinely collected, so implementing the system would not be costly.

Around a million Americans have heart attacks each year, with more than a quarter of those in groups who survive the initial attack dying within a year. Current techniques miss around 70 percent of the patients who are at high risk of complications, according to Zeeshan Syed, assistant professor at the University of Michigan Department of Electrical Engineering.

Syed explains: “There’s information buried in the noise, and it’s almost invisible because of the sheer volume of the data. But by using sophisticated computational techniques, we can separate what is truly noise from what is actually abnormal behavior that tells us how unstable the heart is.”

Doctors tend to look out for several factors in heart attack patients, including blood test results, echocardiograms, medical history and the patient’s overall health. Those identified as having a high risk of sudden death due to irregular heart rhythms can be given medication or implantable defibrillators, which can shock the heart back into its regular rhythm.

However, it’s hard to work out who needs these treatments before it’s too late — most people who die in this manner aren’t identified as candidates for implantable defibrillators.

MIT professor John Guttag explains: “We’re reaching a point in medicine where our ability to collect data has far outstripped our ability to analyze or digest it. You can’t ask a physician to look at 72-hours worth of ECG data, so people have focused on the things you can learn by looking at tiny pieces of it.”

Reference:

1. http://stm.sciencemag.org/content/3/102/102ra95

2. Wired.co.uk

Sunday, March 31, 2013

Customer network value

In marketing area, customer has a life time value. This value could measure one customer's potential consuming ability, which means the profit could obtained form this customer. Now, as social network developing, customer all have network value, which a new part of customer value.

Network value is a value to measure one customer influence on other customers. To obtain this value, the most common way is data mining.

In the figure below, it show the social network connect by iPhone Different colors represent models of i Phones. The meat of this figure is that more and more people have been connected into social network

So, using social network to sell product or do market survey could be a new and effective way of marketing. The First step is find the customer with highest network value, and the method is data mining.

To perform data mining data collection is necessary. Usually, companies have their own social network pages. Such as, Fractal Design a computer case manufacturer has Facebook page and twitter account. Their customer could connect to their pages as fans or followers, and they the customer information is fetched by sellers. Then, they could find customers:
Have many connections,
Always talk about their products

Customer with these features are most like leaders of customers. Companies could focus on these kind of customer more. For instance, they find a customer always review their computer cases on his Facebook pages and use it to build systems, they could send some new and free products to him, then he could becomes a node of advertisement on the social network.

Ref: http://predictive-marketing.com/index.php/tag/social-network-analysis/

Improving soldiers' performance using Big Data

If the information about soldiers deployed in the battlefield is easily acquired, how much will the operation be improved? Equivital, UK based company, developed a wearable computer, called Black Ghost, that can sense the critical information about soldiers, like health status and location, and reply it to the headquarter. By monitoring heart rate, respiration or GPS data, the commander can know if a soldier's performance deteriorates over a certain period or he/she crosses the border.
The LifeMonitor together with Black Ghost provides data management and visualization of data. The big driver of this system is the ability to gather and centralize performance data from multiple soldiers over time. This allow better understanding of soldier and squad performance over time and how to improve it through optimized methods. Also, it helps with soldiers to quickly identify areas in the field that could leave them vulnerable to attack.

Reference: http://www.wired.co.uk/news/archive/2013-01/21/equivital-black-ghost

Crunching Big Data with Google Big Query

Ryan Boyed who is the developer advocate at Google and focuses on Google Big Query presents first part of this video, and in the five years at Google, he helped build the Google Apps ISV ecosystem. Tomer Shiran who is the director of the product management team at MapR and is the founding member of Apache Drill presents the second part of this video.

The developers have to face different kinds of data and a large number of data, without good analyzing software and useful analyzing methods, they have to use lots of time to collect a big amounts of data and then throw some “invaluable” data, but most of time, the “invaluable” data has their own potential value. Google has a good knowledge of Big Data with the situation that every minute there are countless of users using Google’s products such as Youtube, Google Search, Google+ or Gmail. With the big amounts of data, Google begins to use APIs or other technologies to make the developers focus on their fields. Google Big Query also known as Dremel is a Google internal technology for Big Data analysis, and Apache Drill as Wikipedia says “Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google's Dremel system which is available as an infrastructure service called Google BigQuery. One explicitly stated design goal is that Drill is able to scale to 10,000 servers or more and to be able to process petabytes of data and trillions of records in seconds. Currently, Drill is incubating at Apache.” Apache Drill can make users query terabytes of data in seconds and can support Protocol Buffers, Avro and JSON data formats, with using data sources, it can use Hadoop and HBase. MapR Technologies is the open enterprise-grade distribution for Hadoop, which is easy, dependable and fast to use, and is the open source with standards-based extensions. MapR is deployed at one thousand’s of companies from small Internet startups to the world’s largest enterprises. MapR customers analyze massive amounts of data including hundreds of billions of events daily, data from ninety percent of world’s Internet population monthly and data from one trillion dollars in retail purchases annually. MapR has partnered with Google to provide Hadoop on Google computer engine. Drill execution engine has two layers which are operator layer and execution layer, the operator layer is serialization-aware to process individual records and execution layer is not serialization-aware to process batches of records to be responsible for communication, dependencies and fault tolerance. MapR can provide the best Big Data processing capabilities and is the leading Hadoop innovator.

Sources:

http://www.youtube.com/watch?v=I1Z8J5JvKtY

http://en.wikipedia.org/wiki/Apache_Drill

At the Intersection of Biology and Technology

As big data increases its importance, the companies have started to explore the new ways to use it.

Smart companies are gathering massive amounts of data and correlating it with other sources to produce new insights. This is where big data and big data analytics come in. Big data is growing into a catalyst for change on a global scale with, seemingly, limitless possibilities.

The convergence of biotechnology and bioinformatics is providing a great advantage to the companies to gather and analyze data as well as what they can learn from that data.

MC10 and Proteus are the companies which use both “wearable” technologies and digestible microchips to gather and analyze information about processes like brain activity and hydration levels that they intend to use for noble causes like lowering costs and increasing levels of care.

Sano Intelligence also plans to use these wearable devices to “capture and transmit” blood chemistry information continuously to an analysis platform to capture the information from the the human body.

Although the debate continues on how individual physiological data can be legally and ethically used, smart brains are applying new technology to reveal the information underlying the massive amount of data.

http://bits.blogs.nytimes.com/2012/09/07/big-data-in-your-blood/

Hadoop is Old News

Even though Hadoop may be all the rage right now and expected to be the centerpiece of a billion dollar section of the software industry within the next few years, the tech that Hadoop is founded on has already been replaced within Google. Hadoop is the open source software based on two Google research papers that discuss two pieces of closed source Google software, MapReduce and the Google File System. These papers were published almost 10 years ago, an eternity in the fast paced technology market, and Google began to phase out usage of those two pieces of software with new tech in 2009. Since then Google has used research papers to detail some of their newer tech. For instance, Google has detailed the platform that creates the index for Google Search, Caffeine, as well as Pregel, a graph based database that is used to map complex relationships for the vast amount of information that Google stores. Dremel, however, appears to be the most intriguing piece of technology that Google has detailed.

Dremel essentially does what many third parties are trying to do with Hadoop, in that it allows queries similar to SQL of massive amounts of data spread out across thousands of servers very rapidly. Google goes so far as to claim that you can run queries on petabytes of data in a matter of seconds as opposed to minutes or even hours it would take for Hadoop to run a similar feat. According to Google, Dremel can run the type of queries that would take numerous MapReduce tasks in a fraction of the execution time, taking just three seconds to run a query on a petabyte of data. This is an amazing and extremely important accomplishment. With Hadoop you trade speed and response for the ability to analyze massive amounts of data, but with Dremel there would be no trade off. In a very similar way to how the open-source community spawned Hadoop after the release of papers on MapReduce, there is already a team of engineers working on an open-source variant of Dremel aptly named OpenDremel. OpenDremel appears to be a very long way from functionality though and it seems less worthwhile since Google now offers BigQuery, a service in which you can use Dremel on your own data.

Sources:

http://www.wired.com/wiredenterprise/2012/08/googles-dremel-makes-big-data-look-small/

http://bigdatacraft.com/archives/327

https://developers.google.com/bigquery/