Wednesday, April 3, 2013

Disney and Big Data

Disney started implementing big data techniques into their environment. They use Hadoop cluster to improve information sharing and communicate Disney’s departments. By collecting data from different departments, Disney can now analyze customers’ behavior such as attendance to the theme parks, purchases, and viewship of Disney TV programs. It was very exciting news for Disney because the price of introducing Hadoop was really low. Disney estimated that a Hadoop project only costs $300,000 to 500,000, and that was a really comparative bargain for a company earning millions of billions dollars.

And this year, Disney can track their customers’ behavior more convenient. This spring, magic band is introduced. Magic band is a wrist band which includes a RFID chip to be worn by theme park visitors. The chip is not only encoded personal information, preferences, and credit card information, but it also tracks their personal behavior. The Disney characters now can call your name when meeting with you. The band can even be a hotel key when you go back to a hotel in Disney theme park. This is such a new stuff recently, we still don’t know if people like it. In addition, I want to ask if the band is stolen, what should they do?

References:

1. Disney case study summarized from PricewaterhouseCoopers, Technology Forecast, Big Data Issue 2010.

2. http://www.davidajacobs.com/2013/01/disney-goes-big-data-with-magic-band/

To Buy or Not to Buy: Mining Airfare Data to Minimize Ticket Purchase

Retrieving and analyzing data from a flight data recorder after a typical flight is not new. Airlines often check a quick-access recorder that operates in parallel with the flight data recorder, examining certain parameters to improve operations and safety. But current tools are limited to looking for known issues, and the amount of data can be staggering. MIT professor John Hansman says the key is developing analysis tools that can effectively utilize all the information.

Commercial airlines in the United States are not required to implement a flight-data monitoring program. But the Federal Aviation Administration has a flight-operations quality-assurance program that includes guidelines airlines can follow on a voluntary basis.

Airlines typically monitor known parameters that have helped identify issues in the past. Things like engine thrust and aircraft speeds, as well as flight control positions such as elevator and rudder inputs, are among the things studied at the end of a day’s flying or when flight data is analyzed after a crash.

Professor John Hansman says that “it’s a classic data-mining problem.”

A group of researchers in University of Washington developed very interesting data mining technique to predict the most optimal prices of flight tickets. You can see the full version of the paper in the following link below. Here is an interesting part that I like to share with you.

“Corporations often use complex policies to vary product prices over time. The airline industry is one of the most sophisticated in its use of dynamic pricing strategies in an attempt to maximize its revenue. Airlines have many fare classes for seats on the same ight, use di_erent sales channels (e.g., travel agents, priceline.com, consolidators), and frequently vary the price per seat over time based on a slew of factors including seasonality, availability of seats, competitive moves by other airlines, and more. The airlines are said to use proprietary software to compute ticket prices on any given day, but the algorithms used are jealously guarded trade secrets.”

“Product prices become increasingly available on the World Wide Web, consumers have the opportunity to become more sophisticated shoppers. They are able to comparison shop especiently and to track prices over time; they can attempt to identify pricing patterns and rush or delay purchases based on anticipated price changes (e.g., I'll wait to buy because they always have a big sale in the spring...").”

1- To Buy or Not to Buy: Mining Airfare Data to Minimize Ticket Purchase Price,
Oren Etzioni , Dept. Computer Science University of Washington, Seattle, Washington 98195
etzioni@cs.washington.edu.

Utilizing Big Data with user verification

Even though everybody says that now is the time of Big Data, many companies concerns a glut of information and try to benefit from it. The solution for it may be to make an environment of securely accessing data to provide better customer experiences. Companies can provide exclusive and authoritative data and increase their margin by decreasing fraud.

Here are the examples.

Self-storage facility owners cannot auction off the belongings of a customer who is past due on their payments if that customer is on active-duty. They can check if he/she is on active-duty by accessing Big Data of military personnel information
By using Starfish EARLY ALERT, at-risk or low-performing students can be identified.
New apps like FluNearYou and Flu Trends from Google help monitoring epidemics and stopping from spreading

This kind of action has been recently made a great deal of progress, but still there are growing demand.

Reference: http://www.wired.com/insights/2013/01/the-utilitarian-side-of-big-data/

Tuesday, April 2, 2013

Big Data Ethics

Throughout this course we have all seen some amazing visualizations and great insights using big data. Big data has been around for a long time but data is now able to be stored in unprecedented amounts, especially personal data. Many of the visualizations using twitter data that we have seen has been very useful and interesting but what is in place to make sure these studies respect privacy? There is currently very weak regulations for collecting consumer data and privacy settings on web browsers are not legally binding. Regulation of data collection and use is a "Big Data" problem within itself. Many people are afraid that along with these insights using personal data could come profiling and discrimination. Last year the European Union came out with the Data Protection Directive. This directive has a very broad scope in protecting personally identifiable data and holding controllers responsible. The United States does not have anything that compares to this. So where do we look from here? Be conscious of what you do with your personal data and use good judgement when attempting to gain insight on any data that could be related back to individuals. I added some links below that touch on privacy regarding big data.

<http://searchcloudapplications.techtarget.com/feature/Big-data-collection-efforts-spark-an-information-ethics-debate>

<http://www.stanfordlawreview.org/online/privacy-paradox/big-data>

WEB DATA SOURCES FOR SPORTS (1)

This article I will share the web data sources of various main sports for people who are interested in getting data from them. As we all know, along with the development of internet technique, there are more and more web data sources than it’s used to be. Many of these data sources originate from their respective sports league’s official governing body, however At this point, there are also a few amount of third-party sources that offer useful data as well. The following are the sources I have gathered:

Baseball

MLB.com

The governing body of Major League Baseball, contains a wealth of sortable data and a variety of colorful and easy to understand graphical depictions of player performance.

Retrosheet.org

Retrosheet.org is a historical game data website with complete and continuous boxscore data since 1952, textual narratives of game play for nearly every major league game of record, player transaction data, standings, umpire information, coaching records, and ejections of players and managers alike.

Baseball-reference.com

it is another baseball statistics source that holds historical and current data, awards, league information, and a blogging feature where users can share information and insights.

baseball1.com

Started in 1995, the Baseball Archive started as a personal data collection and soon grew into an amalgam of multiple baseball data sources that can be freely queried by any user.

Basketball

NBA.com

This data sources ranges from basic statistical rankings by both player and teams, to more sophisticated plus/minus ratings and interactive graphics of player point shooting.

Basketball-reference.com

This site attempts to be comprehensive, well-organized, and responsive to data requests. The basketball data is relatively straight-forward and easy to navigate.

Cricket

Cricinfo.com

ESPN’s cricinfo.com bills itself as the top cricket website that includes cricket news, analysis, historical data as well as real-time matchups.

Howstat.com

Howstat.com is another Cricket data repository with many features. Aside from having historical and real-time data, howstat.com also contains a superb searching and sorting application to make data requests simple and easy to use.

Football

NFL.com

The National Football League, governing body of American football, also keeps data on their official league website of NFL.com. This data is fairly standard, composed of top ranked players, player comparisons, and team statistics.

Pro-football-reference.com

Pro-football-reference.com provides ample statistics, analysis and commentary to hold any football enthusiasts interest. Users can peruse reams of data regarding coaches, the draft, historical boxscores and team rosters over the years, much of which is unavailable through the official league’s website.

AdvancedNFLStats.com

AdvancedNFLStats.com is a more research-driven collection of football enthusiasts that share their insights and passion for the sport. While this website does not contain the usual faire of historical or real-time data, it instead focuses on sabermetric-styled creations such as game excitement rating, comeback expectancy, etc.

Tutorial: Python 3.3.0

If we need use Amazon EC2 to do the Big Data project, we have to know the Python programming language. The example we used in the class had been explained by Xinyu. Python runs on Windows, Linux/Unix, Max OS X, and has been ported to the Java and .Net virtual machines.

Python is free to use. Today I want to show you how to install Python software and the detailed tutorial.

1. Go to www.python.org

2. Choose the right installer for your OS in the download page. I choose Python 3.3.0 Windows X86-64.

3. Follow the steps to install it.

4. Choose IDLE in your programs to start programming.

5. If you want to program in files. You can choose file >> New window. And after the programming, you can click Run >>> Run the Module.

Now you have know how to install and use python. Next I want to show you some detailed tutorial for programming in Python. This is a YouTube playlist. The playlist address is https://www.youtube.com/watch?v=4Mf0h3HphEA&list=ECEA1FEF17E1E5C0DA You might need 10 hrs to learn that. It is very helpful.

How Netflix Recommendations Are Made

Netflix uses a wide array of Big Data techniques to generate their above average recommendations. Netflix uses machine-learning algorithms heavily, essentially before or after almost every other step, in generating recommendations. This focus is important because it raises significant issues with processing. With online processing, user interactions are responded to rapidly, but the amount of data that can be processed and the computational complexity of the processing are limited. Offline processing alleviates both of these issues, but lowers responsiveness, increasing the likelihood of data becoming outdated during processing. Nearline processing is a middle ground option that allows for online processing but is not required to occur in real time. With each of these possibilities come complex consequences and side effects. To control this, Netflix uses a combination of all three methods of processing across Amazon’s Web Services in an architecture illustrated below.

As you can see, this is an extremely complex setup. Netflix uses offline processing for calculating overarching trends or other things that require no user input, as well as machine learning to develop algorithms that can be used for result calculations. Nearline processing is used largely to develop backup plans should online processing fail to produce results as quickly as required. Nearline is also used in situations where time is of less importance than accuracy, for instance updating recommendations to show that a movie has been watched, while the user is watching the movie. Online computing is used largely in response to user activity, such as searching for a category. Netflix’s hybrid approach is particularly useful in situations where intermediate results can be batch processed and then used to calculate more specific results in real time in response to user activity. Most of Netflix’s model training and machine learning is done offline and then used online.

Netflix's hybrid approach is particularly important to big data, because it manages to create very strong recommendations, less likely to be accomplished using only online or nearline methods, while still maintaining a fast response time that would not be possible using only offline approaches.

Source: http://techblog.netflix.com/2013/03/system-architectures-for.html

"Google facing fines in EVERY EU country as Information Commissioner launches probe into search giant's privacy policy"

I found this article online @ Dailymail.co.uk --> It looks like a really interesting article and even discusses some of the topics mentioned today in class. Take a quick read and let me know what you think in the comment box.

Google-facing-legal-action-EVERY-EU-country-data-goldmine-collected-users.html

Siri and Sentiment Analysis


Image from parent article

     Siri, Apple's iPhone/iPad "genie", lets you use your voice to send messages, schedule meetings, place phone calls, emails, etc. According to Apple, Siri not only understands what you say, it’s smart enough to know what you mean. So what if Siri was used in partner with sentiment analysis to determine the emotional tone in text messages, emails, or social media posts? If Siri could tell you if your message could be construed as having an unintended negative underlying message, or if Siri could serve as a questioning firewall before posting an negative post that could go viral, would it help protect you from potentially severe reputational damage? Rado Kotorov wonders if this implementation of sentiment analysis will help make the world a better place with less conflict and argument, or if it just would leave piles of unread messages in your inbox.
     So how can a machine predict the emotions of messages? This is a challenging question, given that even humans are not entirely perfect in the art of discerning other humans' emotions. Sentences can contain subtleties in various phrases in differing languages, and short answers can also probe problems with sentiment analysis. Sentiment analysis typically uses a way of "scoring" a phrase by the amount of negative or positive words; there is hardly any room for sentiment analysis to determine the meaning behind phrases that mean the opposite of just the words themselves (sarcasm).
     In my opinion, if this technology were to be implemented, sentiment analysis would need to advance to the point of determining nuances in colloquial language and the nature behind how humans communicate in order to be effective. Parent article here.

Facebook and Data mining

If you haven’t fled Facebook for Google+ or abandoned social networks entirely, you probably–like me–have a lot invested in the platform. A new feature is in beta on Facebook: Graph Search. If you get through the waiting list to try it out, you’ll find lots of options for targeted searches centered on your social network. Graph search works by linking together terms and restrictions to allow for very specific searches within the network: you can look for images from friends based on a common location or subject, or find everyone in your social network who went to the same university and are fans of Glee. Is it useful? The possibilities for networking–from finding local friends who share a passion for running to gathering info on a potential new campus to making connections at a company–are immediately clear. But it’s also a powerful (and perhaps alarming) data mining tool that puts front and center just how much data some of us have committed to this social network already.

Those with access to the new search mechanism have already created a stir with sites such as Tom Scott’s “Actual Facebook Graph Searches,” which includes several juxtapositions of targeted search queries that could reveal everything from personally embarrassing information to illegal acts within certain countries. Of course, targeted Google searches or just a quick browse of an ill-considered profile can be equally as revealing, but there’s an alarming efficiency to this new method of data-mining within the social network. The availability of all this data is definitely going to lead to some tense Institutional Review Board debates, as it offers an easy way for all of us to see some of the incredible marketing and interest data that Facebook has been amassing on its users. It could certainly be a fertile ground for social research–but are all Facebook users really clear on how much information they’re sharing?

The introduction of Graph Search makes this an important time to revisit privacy settings: EFF has broken down some of the new implications. Check out Brian’s essential steps to checking Facebook privacy to get started. Searches through images can be particularly hard to control, as it pulls images from everyone’s albums and your friends might not have the same standards for privacy as you do. This next iteration of social search is also another opportunity to talk with students about their digital identities and privacy choices. I’ll definitely be taking a look at its ramifications in my digital communication learning community course this semester, as it shows how easy it is to pull personal information out of the noise of social media.

Reference:

http://actualfacebookgraphsearches.tumblr.com/

Monday, April 1, 2013

Big Data vs. Big Applications of Small Data

Big data is awesome and has done a lot of great things for businesses, as we have seen in a lot of the previous blog posts. We have become obsessed with big-data and the idea that it will translate to profitable businesses. One of the reasons that big data has become so popular is because it promises a new age of science to decision-making and business reasoning.

Despite all of these promises, it is still unclear how to move from vision to reality. However there is another type of data that is easier to deal with but can still give a lot of useful information: small data applied in big ways. Two different businesses have capitalized on this idea and have profited greatly from it. One business is Burberry, which is a luxury brand, and the other is Caesers, which holds stakes in hotels, spas, and casinos.

Burberry started using data in business by simply creating a better experience for its customers. It put screens to connect shops to the head office, installed audio-visual customer information screens, and gave the staff iPads. This led to the company to using social media to engage its audience and cross-reference the information gathered from in-store purchases to the iPads that the staff had. This is all considered small data but that is used in a big way.

Caesars Entertainment started with big data analytics long before Burberry started using small data. However, they were unsatisfied with the lack of small details in the big data that they had previously acquired. They decided to revamp the customer experience and followed customers from search until they appeared on Caesears property. Caesars has expanded that so that customers with a losing streak receive a personalized voucher. The company has learned when it is best to give credit to a customer and when it is best no to do so.

Burberry and Caesars, coming from different ends of the data spectrum, both ended up in the same place: focusing on small data in order to make sure the customer comes first. This has led to profit increases for both companies and Burberry became the fastest-growing luxury brand in 2012. So even though this class is focused on big-data, we must not forget that small data applied in a big way can be just as valuable.

Source:
http://www.forbes.com/fdc/welcome_mjx.shtml

Big Data's biggest challenge? Video Games..

I'll start with a few facts in order to convince you of the enormity of information held in gaming.

1) over the last decade the video game industry has grown from 200 million consumers to 2 billion consumers.

2) "Battlefield 3:" 1 TB of data is generated daily (this data includes kills, deaths, shots, explosions, etc.)

3) More simple games such as "The Simpsons: Tapped Out" generates 150 GB daily and 4.5 TB Monthly

4) In a typical month over two and half billion sessions are held over the entire platform of games (~ fifty billions minutes of game play.)

The reason the gaming platform finds big data so interesting is due to its evolution from paid to play games to free to play games. Many gaming companies are changing their business model from an initial one time payment to free to play in hopes that you will spend money to receive in game help (ie better weapons, new characters, different maps, etc..) They find that the relationship between prices and purchases or limitations and purchases are vital in maximizing their profits. Rajat Taneja states that "the right data at the right time is more valuable than bigger data sets." Since there is so much data being generated daily, descriptive data of one day may be more telling than descriptive data from the whole week. By looking over a small data set you may over look an opportunity to enhance the consumer's gaming experience.

Reference: http://www.youtube.com/watch?v=ZK_PXlbvOfM

Music and Big Data

Data is everywhere, and most of the time we don't even realize it. Digital stores like itunes would actually provide analysts in the music industry with insights into listening habits, preferences and a whole host of other data that could be used to understand the music consumer better. Information like this is being used to drive the type of songs and the type of branding that goes into channels such as online radio.
For starters, there's this company called 'The Echo Nest' who have actually spent over half a decade analyzing over 30 million songs spread over 2 million artists. This information is being used to populate data points that are driving a whole new era of music, dubbed 'fanalytics'.Companies like Echo Nest are even analyzing the tempo, pitch and other aspects of songs and relate them to cultural habits and other behavioral and consumption data to help better understand what sells and what does not.
Like all other industries before it, the music industry turned towards data analytics to maximize the avenues of profit. The technology being used as a 'recommendation engine' for music, is very similar to what one saw when they bought books from Amazon.com. Recommendations would be made for similar books, or similar customer profiles, hence giving the buyer more avenues to explore.
One of the ways where Big Data tools can help record labels is in understanding their local customers better is with regards to live concerts. Big Data tools can help organizers understand which cities to do shows in, which artists to promote, what songs are trending in those cities, what avenues to use to promote concerts and much more. This may sound like Big Data is just helping record labels fill in their yearly planner better, but if done correctly, it can help artists know their audience better, plan their shows better and ultimately make more money. Lady Gaga's gross profit from concerts alone was over $225 million in 2011. Her manager admits to being a Big Fan of Big Data as it is helping them plan the stage and set list based on the mood of the city they have concerts in, ensuring a better turnout and response. Music artists are already using 'fanalytics' to give them the next level of understanding of their listeners. This information is already being used in a whole bunch of activities including merchandising, concert-planning and even song writing.

Sources: http://venturebeat.com/2012/02/17/music-hack-day/
http://venturebeat.com/2012/02/17/music-hack-day/
http://www.webpronews.com/the-echo-nest-where-big-data-is-taking-the-music-industry-2012-04

How businesses, criminals, and governments track what you're doing

There have been a lot of cases in the news about identity theft which has made a lot of people wary of criminal activity. What people aren't realizing is that there is also a bigger privacy issue growing but not with criminals, with companies like Google, Amazon, and Facebook. At the moment, there aren't a lot of laws and regulations that control what these businesses can collect data-wise. Predictive analytics of big data are is a tool that is giving governments and businesses the capability of being even more intrusive.

The data being used can be gathered using point-of-sale systems, mobile devices, cameras, microphones, internet searches, and online tracking technologies. One example is the detailed transactions that are saved by retailers such as Wal-Mart and Target. Another example is "likes" and "shares" on Facebook. Even your searches in Google are saved. Statistics, modeling, and data mining are only some of the tools being used to analyze the huge amount of data that people in the US give off every single day. Targeting customers is the primary objective of this analysis.

Target is very popular targeting customers, especially new moms. A story about this was discussed in an earlier blog post. Another application is dynamic pricing. Prices can be changed based on a algorithm that estimates customers' willingness to pay. This is especially useful and accurate to online retailers who have more personal information, purchasing information, and how many times a person has looked at a product. This gives retailers a much better idea of how much customers are willing to pay for certain items.

Some examples of patterns and correlations discovered by dig data are:

Facebook "likes" revealing political and religious views, drug use, marital status, and sexual orientation
Blue Cross/Blue Shield buys shopping data.

If a person buys plus-size clothing, the plan could flag them for potential obesity and then even higher healthcare costs

President Obama's 2012 campaign used datasets to identify Republican- leaning voters who might be persuaded by specific issues.

What makes all of this so interesting is that there aren't very many regulations so who knows where this could go if the government and businesses continue to be unregulated in what they can and cannot collect.

Source: http://www.post-gazette.com/stories/opinion/perspectives/big-data-is-watching-you-681554/

The potential career link between big data and Wall Street

It has been a long known secret that Wall Street will hire from fields like rocket science trying to find skills in modeling, advanced mathematics, and analysis. But did you know that there is now a job field that Wall Street uses to describe these data and mathematics driven analysts and that it is directly related to big data. Their called Quants. That's short for qualitative analyst.

I thought that since we are getting close to the end of the semester, that I would share about this field as it uses big data in a very finically rewarding career. It turns out that most large Wall Street investment firm will employ Quants as analyst who build models for understand markets, securities and instruments based on large data sets of past market performance. And these analysts have become so integral to Wall Street that graduate degree programs have been created specializing in training Quan's. here are the links to two of these programs: the first is at South New Hampshire University (http://www.snhu.edu/online-degrees/graduate-degrees/MBA-online/quantitative-analysis.asp) and the second is at UC Berkeley (http://extension.berkeley.edu/spos/quantitative.html).

If you were to read the Wikipedia page that describes Quants, it has a marked similarities to many of the things that describe Industrial Engineers only it focuses solely on the uses of the techniques to build predicative and descriptive models involving different types of finances. Quants uses tools like Monte Carlo simulations, stochastic modeling, and time series analysis to inform investors and portfolio managers on different areas of finance.

Gavin posted last week about High Frequency Trading. The foundation of High Frequency Trading is models and algorithms made by Quants. I know that there are undergraduates in the class that are considering an MBA and taking their IE knowledge to the business world, so I thought that I would share this in case anybody was interested in a career using data mining in business.

Data mining could predict heart attack risks!

A team of researchers has used data mining techniques to find subtle changes in electrical activity in the heart that can be used to predict potentially fatal heart attacks.

Researchers from the University of Michigan, MIT, Harvard Medical School and Brigham Women’s Hospital in Boston sifted through 24-hour electrocardiograms (which measure the electrical activity in the heart) from 4,557 heart-attack patients to find errant patterns that until now had been dismissed as noise or were undetectable.

They discovered several of these subtle markers of heart damage that could help doctors identify which heart attack patients are at a high risk of dying soon. Electrocardiograms (ECGs) are already used to monitor heart attack patients, but doctors tend to look at the data in snapshots rather than analyze the lengthy recordings.

The team developed ways to scan huge volumes of data to find slight abnormalities — computational biomarkers — that indicate defects in the heart muscle and nervous system. These included looking for subtle variability in the shape of apparently normal-looking heartbeats over time; specific sequences of changes in heart rate; and a comparison of a patient’s long-term ECG signal with those of other patients with similar histories.

They found that looking for these particular biomarkers in addition to using the traditional assessment tools helped to predict 50 percent more deaths. The best thing is that the data is already routinely collected, so implementing the system would not be costly.

Around a million Americans have heart attacks each year, with more than a quarter of those in groups who survive the initial attack dying within a year. Current techniques miss around 70 percent of the patients who are at high risk of complications, according to Zeeshan Syed, assistant professor at the University of Michigan Department of Electrical Engineering.

Syed explains: “There’s information buried in the noise, and it’s almost invisible because of the sheer volume of the data. But by using sophisticated computational techniques, we can separate what is truly noise from what is actually abnormal behavior that tells us how unstable the heart is.”

Doctors tend to look out for several factors in heart attack patients, including blood test results, echocardiograms, medical history and the patient’s overall health. Those identified as having a high risk of sudden death due to irregular heart rhythms can be given medication or implantable defibrillators, which can shock the heart back into its regular rhythm.

However, it’s hard to work out who needs these treatments before it’s too late — most people who die in this manner aren’t identified as candidates for implantable defibrillators.

MIT professor John Guttag explains: “We’re reaching a point in medicine where our ability to collect data has far outstripped our ability to analyze or digest it. You can’t ask a physician to look at 72-hours worth of ECG data, so people have focused on the things you can learn by looking at tiny pieces of it.”

Reference:

1. http://stm.sciencemag.org/content/3/102/102ra95

2. Wired.co.uk