Analytics and Visualization of Big Data

Monday, April 29, 2013

Big Data Analytics of Eventual Super Bowl Winning teams

Jessica Clemmensen and I, Wilson Lyle, did our final project on big data analytics of eventual Super Bowl champions.

Here is the link to the video tutorial we recorded: http://www.youtube.com/watch?v=jX9K82aE4so

Below is the paper. If anyone is interested in reading it as a PDF as opposed to a blog post, contact Jessica or me and we would be more than happy to send it to you. Enjoy.

Wilson Lyle: WWL0002@auburn.edu
Jessica Clemmensen: JLC0030@auburn.edu

Conference Paper

Big Data Analytics

Football Statistics Super Bowl Winning Teams

This paper was written as a summary for a project that was completed for INSY 4970 at Auburn University.

Unless otherwise specifically stated, the information contained herein is made available to the public by Jessica Clemmensen, Wilson Lyle and Auburn University for reference or non-profit use. The intent of this paper was to further understand the sport of football and how success can be attained.

Neither Jessica Clemmensen, Wilson Lyle, Auburn University nor any other agency or entities thereof, assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, product or process disclosed in this paper.

Neither Jessica Clemmensen, Wilson Lyle, Auburn University nor any other agency or entities thereof, assumes any liability or responsibility for money or any other assets lost due to gambling based on the conclusions stated within the paper; this is a purely academic paper and should not be used for profit. Any other use is prohibited.

Reference herein to any specific commercial product, process, service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement, recommendation, or favoring by Jessica Clemmensen, Wilson Lyle, Auburn University or any entities thereof.

The views and opinions expressed in this paper do not necessarily state or reflect those of Auburn University or the Auburn University Industrial and Systems Engineering department.

Executive Summary

At the beginning of the semester in INSY 4970, the project group decided to develop a big data, data mining project that would seek to develop a model that would accurately predict Super Bowl winners. In order to complete the project, the group downloaded statistics from ESPN’s website which lists statistics for all teams beginning in 2002 for regular and playoff seasons.

In order to narrow the scope of the project, a set of hypothesis and research questions were utilized to explore data within the regular and playoff seasons. While these questions focused on the statistical data from ESPN, miscellaneous circumstances were also considered such as emotional effects on player performance and circumstances surrounding the Super Bowl. Microsoft Excel, Minitab, and Orange Canvas software packages were used during data analysis.

The project group analyzed a variety of statistics, looking for trends within the data that was unique to Super Bowl winning teams. The team analyzed effects of first round byes, compared playoff performance to regular season performance, and attempted to define the concept of an “elite” quarterback during the playoff season. Regular season analysis included developing correlated characteristics of eventual winners, defining the term “hot” when referring to teams, examining offensive and defensive strategies, as well as establishing conference difficulty.

After data analysis was completed, the project team determined that there was not a complete model that would accurately predict the outcome of the Super Bowl for all teams over the last 11 seasons. However, there were characteristics that were common to all Super Bowl winning teams. Such characteristics are a balanced offense, as well as game strategy. Championship winning teams tend to balance both offense and defense, but both offense and defense are above average. Quarterbacks tend to improve between the regular and playoff seasons. Conference difficulty, historically, does not indicate which team will win the Super Bowl. Though emotional influence cannot be quantified, the project group believes this is a motivating factor in previous championship teams.

Introduction

The Super Bowl can be called the television event of the year, garnering a staggering 109 million viewers in 2013 alone. Viewers tune in to watch their favorite team try to capture the Lombardi Trophy, placing bets and yelling at the television. Americans across the country end their Sunday evenings at least once year happy or miserable, depending on whether or not the final score favored their wishes.

The score may indicate the outcome of the game, but what factors contribute to a team’s win? This question has many possible answers, but it requires an in-depth analysis of statistics for the winning teams of past Super Bowl games. If one could accurately predict the winner of the Super Bowl based on statistics, not only bragging rights could be earned but financial gain. In 2012, Super Bowl viewers wagered an estimated $93,899,840 on various bets. Predicting the winner of the Super Bowl with confidence could finance a college education, buy a new car or a trip abroad, and with so many viewing and actively invested in the game, there is a plethora of data to analyze.

The question the project team sought to answer is: are there certain patterns or trends that eventual Super Bowl winning teams show? In order to provide a solution to this question, it is necessary to develop an initial starting point. As a team, it was decided that the best approach would be to group questions by regular season and playoff season performance. The questions are listed below.

During the playoffs:

1. How many eventual winners had first round byes?

2. Does the way the teams perform during the regular season line up with how they played during the playoffs or do many teams just go on an unprecedented Championship run?

3. Most experts would say that you must have an “Elite QB” to win a Super Bowl (10 of last 12). What makes a QB Elite? Is this a catch-22? Many QB’s aren’t actually considered elite until after they win a super bowl. This begs the question of how you determine if a QB is on his way to becoming elite.

During the regular season:

4. Are there specific stat categories that eventual champions excel at (PPG, OPPG, YPG, 4^th down conversion percentage etc.)?

5. Many teams enter the playoffs “Hot” (Giants 2012, Packers 2011). Do most eventual champions end the regular season “Hot”? How do you consider a team “Hot”? The last 8 games? 4 games?

6. Do most eventual champions run similar offenses? Pass/Run/Balanced?

7. Do most eventual champions rely on their Defense? Offense? Neither?

8. Running back-by-committee has becoming increasingly popular in the last 10 years. Do most eventual champions rely on a single back or multiple?

9. Does the eventual champion come from the “easier” conference?

The three italicized questions are believed to provide the most insight to the question at hand. The group also considered miscellaneous circumstances surrounding the Super Bowl such as Ray Lewis’ impending retirement this year and the effect of Hurricane Katrina in the 2010 match up.

Method

The group retrieved all data except AFC and NFC records from ESPN’s website (http://espn.go.com/nfl/statistcs). AFC and NFC records were downloaded from a visualization website (http://visual.ly/afc-vs-nfc-records-year?view=true). The data was compiled using visual basic; however, at times during data compilation it was necessary to manually input data in Excel. The manually input data included byes, running back-by-committee, and data associated with miscellaneous circumstances. Though data collection included both manual and coded input components, it was all gathered from ESPN.com except for the aforementioned conference records. For data group members did not enter manually, Visual Basic Macros were written to compile and manipulate the data. The Visual Basic Macros created workbooks which housed statistical data for seven categories: defense, downs, passing, receiving, returning, rushing and total, and each category’s data was stored in three workbooks. Of the three workbooks, the first held the statistics for the regular season for every team. The second consisted of statistics from the playoffs for all teams that made it. The third workbook was comprised of all opposing team statistics for each team. Unfortunately, the group could not find anywhere that had playoff statistics for opposing teams, so the opposing team statistics were limited to the regular season. The group consequently did not analyze the opposing team statistics. All of the workbooks had essentially the same macro, which consisted of a single “for” loop. Each “for” loop; the Macro created a new spreadsheet with the name of the year, and then inserted the data from ESPN.com. Lastly, it deleted all of the unnecessary rows before and after the necessary data. Unnecessary data in these rows included links to other aspects of the ESPN website which did not pertain to the project. The only difference between workbooks was the specific URL because a different statistic was downloaded each time.

The next step the group needed to complete was to compile all of the data into a single spreadsheet. This was accomplished through a two-step process. The first step involved creating a Macro that assembled all regular and post-season statistics of Super Bowl winning teams into a single workbook with each worksheet representing a year. The second step did not involve writing a Macro; rather, it proved to be simple. The group manually copied and pasted the 11 worksheets into a single worksheet in another workbook.

The group analyzed the data in a variety of ways. A number of conclusions could be made by simply looking at the data once it had been arranged in a single workbook. Other data needed formatting using Excel; however, the majority of the project group’s conclusions were drawn through data analysis in statistical software packages Minitab and Orange Canvas. When analyzing whether having a bye affected whether or not a team won the championship, it was as simple as looking at the spreadsheet and seeing that only five of the last 11 teams had first round byes. The running back-by-committee and miscellaneous analysis were similar to the bye analysis. Other categories utilized Excel in order to draw conclusions. These included offensive strategy, team strength and conference difficulty. They only required excel to calculate averages, medians, probabilities and standard deviations. There were two categories, the “meat” of the project, which were analyzed using statistical software. Minitab was used to compare regular season and playoff performances through a series of t-tests. When analyzing championship characteristics, the group used Orange Canvas software to create scatter-plots for each of all statistics downloaded. The group analyzed each statistic for all teams, coloring championship teams blue instead of red. This was done so Super Bowl winning teams were prominent in the scatter plots. Trends for these statistics were then noted and analyzed.

Results & Conclusions

Byes

The project group looked to answer the nine questions outlined in Research Hypothesis and Method. The initial question was established in order to look for correlations between eventual Super Bowl winning teams and whether or not these championship teams had a first round bye. A first round bye is awarded to a team if the team finishes first or second in their respective conference and won their division. This means that a team does not have a game the first week the playoff season begins. It also guarantees that first round bye teams play their first playoff game in their home stadium. Initially, the project group hypothesized that eventual Super Bowl winners would have, for the majority, first round byes. It was believed that because a team played well enough in the regular season to earn a first round bye, that they would be qualified to win the super bowl as well. It was also believed that a first round home game would help a team win the game, and beginning the playoff season with a win would then help a team gather momentum and garner an emotional edge. The group predictions proved to be inaccurate. When analyzing the data, only five out of the 11 Super Bowl winning teams evaluated from 2002 to 2012 had a first round bye. The Saints, Steelers, Patriots (2), and Buccaneers had first round byes while the Colts, Giants (2), Steelers, Packers, and Ravens did not. The Steelers had a first round bye in 2008 but not in 2005. This indicates that the first round bye was not necessary for the Steelers to win the Super Bowl. The Patriots had first round byes for all four of their Super Bowl appearances in the last ten years; however, they only won two of the four games. Consequently, it cannot be determined that the first round byes were the reason the Patriots had successful Super Bowl appearances. Generally, championship teams have first round byes less than half of the time over the last ten years. Because more teams do not have first round byes than do, the project group concluded that first round byes do not appear to affect the outcome of the Super Bowl.

Playoff vs. Regular Season Performance

The project group also looked to evaluate whether or not a team’s regular season performance was indicative of the playoff season performance, noting that it was possible for a team to achieve unprecedented success in the playoff season due to a championship drive not executed in the regular season. The group analyzed regular and playoff season performances by subgrouping data into two categories: offense and defense. Offense performance was broken down into passing, rushing, and down conversions. Defense performance was evaluated through tracking the number of tackles and interceptions.

Quarterback Rating

Because the quarterback is the heartbeat of the passing game, analysis of the quarterback rating was hypothesized to increase after the regular season. The quarterback rating is a formula established by the NFL that essentially looks to assign a value to a team’s quarterback. The formula is comprised of various equations based on completions per attempt, touchdowns per attempt, interceptions per attempt, and yards per attempt. Dean Oliver of ESPN writes that the QBR was established in order to evaluate a quarterback’s expected number of points and probability of winning. The average quarterback rating for Super Bowl winning teams was 90.2 during the regular season and increased to 98.5 in the playoff season. A t-tail hypothesis test was performed in order to determine whether the average quarterback ratings between the regular and playoff season were the same. With a p-value of 0.15, the group reasoned that on average, quarterback ratings for eventual super bowl winning teams do not remain the same between playoff and regular seasons. Based on historical data, quarterbacks for champions outperform their regular season appearances.

Interceptions: Offense

Interceptions are also an indication of success in the passing game. A low value for interceptions per attempt indicates that the quarterback is accurate and precise in executing passing plays. Interceptions per attempt is averaged at 0.025 during the regular season for the past 10 years and decreases to 0.014 during the playoff season. Again, a t-tail hypothesis test was performed, which yielded a p-value of 0.047. Approximately 96% of the time, interceptions per attempt differs during the regular and playoff seasons, and interceptions per attempt is on average lower in the playoff season. This was expected as teams are not able to make mistakes such as interceptions and still win games. Reducing interceptions between the regular season and playoff season is a strong characteristic of Super Bowl winners.

Third Down Conversion Percentage

Third down conversion percentages allow offenses more time on the field and also demonstrate the effectiveness of an offense’s ability to continue to move down the field toward the opponent’s end zone. Third down conversion percentages t-tests produced a p-value of 0.112. Again, this indicates that the hypothesis that the average third down conversions for regular and playoff seasons are the same is not true. Regular season conversion percentages are 41.14 on average and 44.87 in the playoff season. Based on historical data, conversion percentages increase in the playoff season which is logical in that teams must move down the field more once less capable opponents are eliminated from the game pool.

These t-tests produced logical p-values. The statistical tests essentially demonstrate that quarterbacks of eventual championship winning teams play better in the playoffs than the regular season, which is to be expected. It also makes sense interceptions per attempt decrease at a more significant level than quarterback rating increases, because winning teams do not turn the ball over.

Tackles & Defensive Interceptions

T-tests were also performed on defensive statistics. Teams averaged 62.92 tackles per game in the regular season and 63.28 in the playoffs. The p-value testing the equivalence of playoff and regular season average tackles is 0.015. This p-value suggests that the hypothesis that teams make the same amount of tackles in the playoff and regular seasons is not true. The project group found this particular statistic to produce surprising results as the average values for playoff and regular season tackles both round to 63 per game. Interceptions the defense is able to execute is 1.295 and 1.659 for regular and playoff seasons respectively. The p-value for this particular test is 0.128, so it would hold that defenses are able to complete varying average interceptions per game, historically increasing during the playoff season.

Elite Quarterbacks

Super Bowl winning teams have to be able to score which requires a strong offense. The quarterback captains the offense. Quarterbacks are called “elite” when they are likely to be inducted into the National Football League’s Hall of Fame. Since 2002, nine of the 11 teams had quarterbacks who are now considered elite. However, a definition or clear qualities to define an elite quarterback before this occurs would indicate that a team is likely to appear in and win the Super Bowl. The group initially hoped to provide a definition based on offensive statistics though this proved to be a question that remains unanswered. Eventual Super Bowl winning teams all have quarterbacks with a high QBR and high yards per attempt passing completions. These were the only qualities consistent across all championship teams analyzed; however, these alone do not classify a quarterback as elite as many non-elite quarterbacks meet this criteria. Concluding, Super Bowl winning teams had elite offensive leaders 82% of the time over the last ten years. Elite quarterbacks appear in the Super Bowl the majority of the time, yet there is no clear definition which will indicate whether a quarterback in the here and now will be considered elite in the future.

Championship Characteristics

Regular season performance determines whether or not a team is eligible for the playoffs which then determine the teams facing off for the Lombardi trophy. The project group sought to find characteristics of championship teams that were prominent during the regular season. Implementing scatter plots in Orange Canvas, the group looked at each statistic for these teams beginning in 2002 until present day. Of all the data, four statistics were consistent across all teams, with these teams predominantly clustered at an above average level. Such statistics being quarterback ratings, passing yards per attempt, rushing attempts per game, and sacks per game which is logical. A high quarterback rating indicates a high probability of a team winning games as well as a high number of expected points scored. This would demonstrate a qualified quarterback leading the offense which was characteristic of winning teams. An increase in passing yards per attempt and rushing attempts per game indicate that champions run the ball at a high rate and then capitalize on their passing attempts. Sacks per game illustrates defensive strengths which prevents opponents from outscoring Super Bowl winning teams; eventual winners are able to penetrate the defense and apply pressure on the opposing team’s quarterback and offense in general.

Classification: “Hot”

Sports analysts continually discuss whether a team is considered “hot” before entering the playoffs and whether or not this is a deciding factor in the Super Bowl. The group set out to determine whether or not data can be used to identify which teams are “hot” and whether said teams have a chance at winning the super bowl. The group looked at how many eventual champions won their last game of the season, the record for their last five games of the regular season, and the record for their last eight games of the playoff season. Surprisingly, there was no consistency for any of the three. The group initially predicted that eventual champions would consistently win games, cementing their team strengths before playing in the Super Bowl. The Ravens, Super Bowl winner in 2012, went 1-4 in their last five games, including a loss for the final game of the year. Only one team in the last seven seasons won at least four of their last five games; that team being the Steelers who went 4-1. None of the last seven Super Bowl Champions have won the final five regular season games. Although the Ravens lost their last game of the regular season, many eventual champions do in fact win their last game. The Giants (2011) and the Packers (2010) had to win their last two games of the season in order to be eligible for the playoffs. After looking at this data, the project group determined that what matters is how a team is playing upon entering the playoffs at the end of the regular season and not whether or not a team won their last five games. A great example was the Giants in the 2007 season. The Giants lost their last game of the season. However, they lost to the New England Patriots who, after winning that game, went undefeated during the regular season. The Giants played well, only losing by a three point margin which made them the team that was the closest to defeating the Patriots. The Giants rode this momentum through the playoffs, and eventually to the Super Bowl where they were able to defeat the Patriots when it mattered most. Although the project group was able to conclude that a team’s performance matters more than a team win, data did not solidify a definition or classification tool to determine whether or not a team was “hot.”

Offensive Strategy

Eventual Super Bowl winners are able to effectively outscore their opponents, but this poses the question as to how champion offenses accomplish this whether it is through the passing or running game or a balance of the two. Historical data shows that champions’ passing attempts per game ranges between 30 and 35 for 17 of the 22 data points, and rushing attempts per game have a wider range between 25 and 33 for 18 of the 22 teams. The New England Patriots did not follow this trend when individually evaluating their passing attempts per game in the two playoff seasons before their Super Bowl appearance, averaging 27 and 42 passing attempts per game. However, when averaged, the Patriots passed the ball on average 34.5 times each game, which falls within the aforementioned range. The Giants were also outliers, averaging 38.31 and 40.75 passing attempts per game in the regular and playoff seasons. The Indianapolis Colts averaged 38.5 in the playoff season for passing attempts per game. The Baltimore Ravens, Colts, and Steelers fall out of the range for rushing attempts per game. The Ravens and Colts rushed the ball more during the playoff season, while the Steelers consistently rushed the ball more throughout the entire 2005 season. This can be attributed to running back Jerome Bettis who retired after 2005 and explains why the Steelers were not outliers of this range during their 2008 appearance. The ranges for passing and rushing attempts overlap. Because of this, the group concluded that overall, balanced offenses with the ability to effectively throw and run the ball are characteristic of Super Bowl winning teams. This observation was predicted as teams who consistently change their offensive strategy continually challenge opposing defenses, scoring points in ways that are varying and difficult to predict.

Team Strength: Offense vs. Defense

After analyzing champion offenses, it was necessary to determine whether winning teams utilized offensive strategies more so than defensive tactics or balanced both aspects of the team. In order to do this, the group compared average points per game, 24.76, and average opponent points per game, 20.07. The overall averages do not indicate a large difference between winning and losing teams; however, when the score margin is averaged, champions score on average 6.5 points more than their opponents, a touchdown more than defense allows. The group concluded that Super Bowl winning teams have a balanced strategy based on the low scoring margin. Champions collectively have above average offenses and defenses without having to particularly rely on either.

Running Back-by-Committee

The running back-by-committee strategy is becoming increasingly popular at the collegiate and professional level, with four of the last six champions utilizing it. “Running back-by-committee” is defined as having at least two legitimate rushing threats who can line up in the backfield. There are two advantages to this strategy. The first being that extra depth at the running back position will show as the defense line and linebackers grow tired during the course of the game while the running backs are able to continue to play as if it is the first quarter. The second ideology behind the strategy is the most important, which is that the opposing team will have to split up their time in the film room meaning a team will spend less time studying running backs individually. Oddly enough, a championship team that only used one running back validates this prediction. In 2010 the Packers ran the ball with Brandon Jackson during the regular season. The Packers drafted James Starks in the previous draft; however, he was injured in training camp and did not play until the conclusion of the regular season. Then, in the first game of the playoffs, Starks set the franchise record for most yards in a postseason game for a rookie and continued to play at a high performance level for the duration of the playoffs. James Starks clearly has talent; however, he performed at a high level because teams did not have film on him, making it difficult to prepare for and thwart his running skills. The conclusion the group made was that utilizing a running back-by-committee strategy can be an advantage if there is qualified personnel; however, it is by no means necessary to win the Super Bowl.

Perceived Conference Difficulty

The project group predicted that the perceived difficulty of the conference would correlate with the Super Bowl winning teams. In order to execute this analysis, the group collected data over the 11 seasons on how many wins each conference had when facing an opponent from the other conference. Group predictions were that the winning conference, which won more games, would indicate the eventual Super Bowl winner, but in this particular case group predictions were wrong. Since 2002, the Super Bowl winning team was from the conference that won more games only 54% percent of the time. The probability that the AFC wins the Super Bowl and is the winning conference is around 40%. In contrast, the probability that the NFC wins the Super Bowl and is the losing conference is around 37%. Based on historical data, the conference difficulty only indicates the championship team roughly half of the time. The NFC is more likely to produce a Super Bowl winning team as a losing conference than the AFC which has a 14% chance of happening. The probability that the NFC wins as both a conference and in the Super Bowl is approximately 8%. With the 11 data points from 2002 to 2012, inter-conference record is not a consistent predictive Super Bowl indicator.

Emotional and Circumstantial Influence

Though data is important, the project group sought to consider the emotional aspects affecting a team and how emotion plays into the game of football. Football is a strategic and calculated sport; however, players, coaches and fans are all emotionally invested. When players have other incentives for which they are playing, they are believed to play at their best for the duration of the game rather than noticing their body’s natural fatigue. Although emotion cannot be measured or quantified, the group still decided to explore this aspect of the game. Emotion has been evident in the last ten years, most recently this year with Ray Lewis and the Ravens. Ray Lewis was injured in week six and sat out the rest of the regular season (11 weeks). He not only returned for the playoffs but also announced that he would be retiring at the end of the season. The fact that this would be his last attempt for a title combined with his pastoral pre-game speeches had the Ravens playing hard during all games throughout the playoffs. Although Ray Lewis did not play at an all-star level during the championship run, his drive for the win and inspiration for his team, helped the Ravens play at a level that was necessary to end the season as champions.

Another example of an emotional advantage is the Saints run to the Super Bowl in 2009 after the emotional roller coaster that came in the wake of Hurricane Katrina. Katrina devastated the city of New Orleans, killing over 1500 people in the state of Louisiana and forcing 1.2 million people along the gulf coast to evacuate. Almost six months after Hurricane Katrina, the Saints opted to sign a quarterback who was undesired in the league because he was, at the time, returning from a shoulder surgery. That quarterback, Drew Brees, was thankful for a second chance. He moved into the city of New Orleans at a time when other players did not want to live in a city that required rebuilding. He invested himself in the city and helped with relief efforts. Fast forward to the 2009 season; Drew Brees has had successful seasons at this juncture. New Orleans is beginning to see an influx of residents and visitors once more, and life in the city is slowly moving back to its pre-Katrina pro quo. The difference in the city this time however is that the citizens of New Orleans have a phenomenal football team for which they can root. After a first round bye, on January 16, 2010, the Super Dome hosted a playoff game; four years earlier it was being used as a homeless shelter for the citizens of New Orleans. The Saints went on to win the Super Bowl after two victories in the Super Dome and one over the Colts in Miami. After the game was over, Drew Brees was quoted as saying “We felt as if there was no way we could lose this game...This one is for you New Orleans.” The project group believes times like these can cause players to play at a level they ordinarily cannot reach.

Conclusions

After utilizing ESPNs rather large database of NFL statistics, the group was able to answer their hypothesis questions and draw conclusions. The group determined there were three things that do not factor into a team winning the Super Bowl. These are conference difficulty, whether or not they had a bye, and whether or not they used a running back-by-committee system. The group determined that if a team has the proper personnel, then using multiple running backs can prove to be advantageous, however it is by no means a requirement to win the Super Bowl.

The group determined that how a team’s quarterback plays in the playoffs is indicative of how far that team goes. Quarterbacks for eventual champions play better in the playoffs, with a higher QBR and a lower interception rate. The rest of the team also plays better in the playoffs, historically having higher third down conversion rates and defensive interception rates.

There also proved to be consistencies with champions during the regular season. Teams tend to have balances offensive attacks (rushing/passing) along with above average offenses and defenses but not to the point where teams have to rely on either one. There proved to be more consistencies for Super Bowl teams with specific statistical categories. The conclusions the group was able to draw are as follows: teams establish a solid running game and then capitalize on their passing opportunities. Teams are also able to put pressure on quarterbacks and disrupt the running game in a similar manner.

The group found that there are also circumstances that affect Super Bowl outcome but cannot be quantified. These include how “hot” a team is and whether or not teams have an emotional advantage. It was determined that how a team is playing at the end of the season can cause them to be “hot,” however a win/loss record does not depict it. Teams can also have an emotional edge if they have added incentives to play for, like the Saints in the wake of Hurricane Katrina. Though there are common factors among championship teams, there is not a definitive answer as to whether or not a team will win the Super Bowl.

Sources

http://visual.ly/afc-vs-nfc-records-year?view=true

http://espn.go.com/nfl/story/_/id/6909058/nfl-total-qbr-faq

http://www.weather.com/newscenter/topstories/060829katrinastats.html

Sunday, April 28, 2013

Visualization of Analytics on the Go: Incorporating Roambi into your Life

Visualization is key to conveying massive amounts of technical information quickly and effectively. A picture says a thousand words. What is better than seeing a picture, though? Interacting with the data. That is what Roambi is trying to provide for their users. As an interactive big data app, Roambi extracts data from businesses existing data intelligence system and then allows users to manipulate the data into a clear to understand visualization. A key feature of Roambi is its ability for you to create visualizations on the go, simply using your tablet. A year ago, Roambi had approximately 84,000 customers, and I imagine that number has grown as tablet usage has dramatically increased in the past year.

Roambi provides several different lines of business products to their customers. Roambi flow incorporates analytics with additional information (in text) to provide a more visually engaging experience for presentation. Roambi analytics provides real time information from your business intelligence system which allows you to create your mobile visualizations from data extracted from Salesforce, SAP, Oracle, IBM, and other databases.

By allowing users to manipulate their visualizations on the go, Roambi has tapped into unique market space. In my opinion, Roambi would have great value for consultants who are traveling and consistently working with a variety of people. Visualizations of large quantities of data can provide immediate credibility, which is key in the consulting industry. Although Roambi is a paid service, access to visualization on the go can be a pivotal part of selling a process or idea.

Sources:

1. http://gigaom.com/2012/03/11/10-ways-big-data-is-changing-everything/10/

2. http://www.roambi.com/

Big Data for Big Knowledge in Supply Chain Management

Benefits of Big Data in Supply Chain

One of the areas which can be highly benefitted from the use of big data is supply chain management. As companies tend to place more focus on improving their customers’ overall experience rather than just focusing on the bottom line, big data can provide big insights. A recent article on supplychainbrain.com suggests that manufacturing now includes a service aspect, called “servitization.” This new portal of industry focus is requiring more information for operations departments, as it is causing an increase in the complexities of planning. This data must be available to key stakeholders in real time.

A significant aspect of this real time data is the use of shared data. As companies are expanding their ventures into big data territory, they are, with increasing frequency, sharing data across their corporation rather than just keeping it within one department. Cloud computing is allowing this trend to improve business decisions across multiple groups in the business. This allows for better end-to-end process collaboration.

A requirement for maximizing utility out of big data is having a flexible supply chain. If there is no flexibility in the current process, then what value can big data add to a company’s supply chain? The flexibility allows for quick changes which would affect their forecasts. By adding big data as a decision criterion for forecasting, companies are able to more accurately predict their demand based off of customer as well as company behavior.

In lieu of recent tragic global events such as weather phenomenon or violence, companies could use their big data to predict changes in the supply chain. By scanning through social media, weather, news, or other real time data outlets, they might be able to be proactive about changing suppliers, rerouting shipments, or changing production quantities. The future of flexible supply chain management could hinge on the successful application and integration of big data.

Sources:

1. http://www.supplychainbrain.com/content/technology-solutions/sales-operations-planning/single-article-page/article/the-big-data-imperative-in-forecasting-demand-planning/

2. http://www.scemagazine.com/big-data-driving-changes-in-supply-chain-management/

Tutorial: how to load the 1000 genomes data into Amazon Web Services

The format of this tutorial is done such that it gives written instructions followed by a picture for that step.

Step 1.
Start by logging into AWS. Once you have done that, you will see this page. Click "EC2" virtual services in the cloud.

Step 2.
Click on "Launch Instance"

Step 3. The next page will say "launch with classic wizard." Just click "Continue."

Step 4. The next page will be titled "Request instances wizard." Just click "community AMIs tab".

Step 5. Next to the Viewing all images drop down field, type in"1000HumanGenomes."

Step 6. Once the AMIs have popped up, click select next to the first one.

Step 7. After that you will be taken to the instance types selection. Click the drop down arrow and select the type of instance you would like to use. I chose "M1 Large."

Step 8. Next you will be prompted to create a password in order to access your AMI. Type your password in the text field shown in the picture.

Step 9. Next you will be prompted with a "Storage device configuration" menu. Just click continue.

Step 10. It will ask you if you want to tag your instance. You can just click continue.

Step 11.

Next you will be prompted to enter your personal key pair. Enter your keypair into the text field marked in the photo.

Step 12. Next, you will be prompted to enter your security group. Just select the default one.

Step 13. In Step 13 you will be shown all the specifics you requested in the previous steps. Click Launch if they are all satisfactory.

Step 14. After that, you will be told that your instance is being launched. Click "close."

Step 15. In your instances section, check the Status Checks section. After a while, it should say "checks past."

Step 16. After that, you are done. If you have a piece of software called Linux bio cloud on a computer with a Ubuntu Linux operating system, you should be able to work with the data!

Saturday, April 27, 2013

Counter-terrorism using Data Mining

Boston marathon bombings terrorized the country. Shortly after the bombings, FBI looked into mining of data to narrow down on the suspects. The FBI team analyzed 10 TB of data such as cell phone tower call logs, text messages, social media data, photos and videos from surveillance videos and additional photos and videos from general public who were present at the marathon. Twitter data was also analyzed with the help of a company called Topsy Labs which is a repository of tweets from the year 2010 and the location of origin of tweets. Data was analyzed not only few days before the bombings but also billions of tweets related to Boston and its suburbs. This humongous data was analyzed using FBI's software and common tools such as face-recognition and position triangulation. Even though mining of this data didn't lead to the capture of the suspect Dzhokhar Tsarnaev, it shows what data mining is an effective tool for counter-terrorism. In the future, by developing a model and using the features of Artificial Intelligence terrorism can be reduced to the maximum. In the future, just like the movie "Minority Report" where the precogs predict the crime, supercomputers can be made to analyze data from satellite images, drone video feeds, photos and videos uploaded by users in YouTube, Facebook, Twitter and other social media to predict a crime.
Predictive analysis seems to be the future of counter-terrorism.

Reference: http://fcw.com/Articles/2013/04/26/big-data-boston-bomb-probe.aspx?Page=1

Google search and the stock market - Google Trends Strategy

By mining Google search terms over a span of eight years, researchers of Warwick Business School, University College London and Boston University say that early signs of stock market fluctuation can be predicted to buy or sell stocks. They analyzed Google search by users of financial terms and an investment strategy was developed. These search terms were analyzed on a weekly basis and buying or selling of shares was done accordingly. They would open up a hypothetical short position if the volume of search terms went up in the a week and sell it the next week. They bought it if there was a decrease in volume of the search terms considered. This strategy yielded a return of 326 percent return which is almost twenty times more than that of the conventional strategy. With this strategy, stock market trends can be predicted and it might give a whole new level of experience for potential investors and for people playing with numbers.

Reference: http://www.businessweek.com/articles/2013-04-25/big-data-researchers-turn-to-google-to-beat-the-markets

A Facebook user profile through Big Data

A research by a computational knowledge engine shows how people meet and how their life works by analyzing friend and relationship status in Facebook. The research was volunteered by more than a million Facebook users on Wolfram's site. Wolfram research analyzes each and every activity of a user and used it to generate reports for the activity of users in United States.
Reports can be generated for each Facebook user and these reports are amazing. Word cloud, relationship status of friends, distribution of friends' ages, friend network and many other fascinating reports could be generated. Friend clusters are made and friends are classified into social insiders (a friend who share a large number of friends), social outsiders(a friend who shares at most one friend), top social connectors(a friend who connects together group of friends who are otherwise disconnected), top social neighbors(a friend with small number of out-of-network friends - friends of theirs that we don't know) and top social gateways (a friend with large number of out-of-network friends). Basically terms are coined by using graph theory.
These are some of the screenshots from my (Robin Muthukumar) report

My activity in Facebook

Friends network

Color coded friends network

Each user can get his/her own report by using this link http://www.wolframalpha.com/facebook/
Data like these were analyzed and compared to the United States census data and both were found to be identical. This kind of research help the Government to monitor people's mindset and pass bills or amend laws accordingly. This kind of research helps politicians to gather their votes.

Reference: http://bits.blogs.nytimes.com/2013/04/25/looking-at-facebooks-friend-and-relationship-status-through-big-data/

Advertisements and their impact on Facebook users

After being skeptical about web advertising, Yahoo! followed web advertising to make money out of the web. A more refined method of web advertising was used by Google to make more money out of it.
A study about web advertising in Facebook by a team at Facebook shows that merely the presence of an ad in Facebook has an influence on the users. Two data sets were compared. One was the number of users clicking the ad in Facebook. Second data set is the purchasing pattern from an analyst firm Datalogix. On comparison, the people at Facebook and Datalogix observed that the presence of an ad makes users buy the product even if they don’t click on it. According to Rick Robinson a freelance writer "Big Data analytics show that mere exposure to Facebook ads does indeed influences users' purchasing patterns." Facebook suggests ads based on the likes and interests of the user. So it is evident that web advertising has an upper hand in determining the sales of a product than conventional advertising methods. This shows the power of web advertising over conventional methods.

Reference: http://midsizeinsider.com/en-us/article/big-data-analytics-takes-web-advertising

Tutorial: Motion Chart on African Nations

I have looked further into the visualization of GDP per capita versus percent GDP spent on military I created earlier in the semester and wanted to write a quick tutorial on how I altered the Google spreadsheet in order to focus on African nations and what this chart reveals about that continent.

The first step was quite simple but time consuming. I had to go through the data and delete the data on 95 different nations to leave 35 remaining African countries that had sufficient data.

Next I separated the 35 nations into the five UN geographical sub-regions: Northern, Western, Central, Southern, and Eastern.

After grouping the nations under these regions and gave each region a numerical value so I could distinguish them by color on the motion chart.

Northern (1), Western (2), Central (3), Southern (4), Western (5)

When finished, simply select “Insert”, then “Chart”, “Charts”, “Trend”, then finally the image of a motion chart to the right of Trend. Select “Insert” and the chart will be inserted onto your tab containing your data. If you select the drop box at the upper right hand corner of the chart, you can then select “Move to own sheet,” that way you don’t have to move the chart around to look at your data.

Once the chart is created, select the “Color” drop box and select “Region.” This will color coordinate the nations based on the regions in which they are located. Then select the “Size” drop box and select “Population.” This will obviously base the size of the nation indicators on the population of the nation.

Now just press the play button in the bottom left corner of the chart and watch the motion chart at work.

Data from WorldBank

Thursday, April 25, 2013

Visualization Project: Blog Plagiarism

Visualization Project: Plagiarism (Google API)

Team: Greg Adams, Alex Lee, Andrew Smith

Now, nobody go and repost this and claim it as their own ;)

Titanic Competition Using BigML

We used BigML to compete in the Titanic DM competition on Kaggle. Team members: Andrew J. Smith, Greg Adams, Alex Lee, Chelsea McMeen

Digit Recognizer

Team members: Greg Adams, Drew Smith, Alex Lee, Chelsea McMeen

We wrote a Matlab program for the digit Recognizer competition. This is a video that explains how we tackled this competition.

Product review mining

I read some papers and post these days about product review mining. I want share some ideas of mining I summarized.

First, This mining is a kind of text mining.The review mining also call opinion mining. The researchers want find what are the reviewer's opinions on products, either negative or positive.

The review mining do not focus on the ratings, such as product rating on amazon. This mining focuses on the text written by customers or professional reviewer. Certainly, the mining result is helpful for producer to improve products. If the reviews are classified by "cons" or "pros" automatically, such as Newegg.com's review, it is much easier to mining.

The mining is to find feature words, and then based on the number of feature words, measurement scores are calculated.

Feature words could represent customer opinion directly. For example, if customers say "awesome", "excellent", these words show their positive opinions. But if they say, "bad", "s***", negative opinions are shown. Also some feature words are depend on different products, for instance, most customer need a quiet computer case, so "quiet", "no sound" are good for computer cases, but for stereos, these words are negative.

To find the feature word is regular text mining. After searching the feature word, some evaluation method are developed to decide whether the comment is negative or positive The snip below is used ref-3, the SO value is standard whether this positive or negative.

Ref:
1. http://www.slideshare.net/felipemattosinho/mining-product-opinions-and-reviews-on-the-web
2. Movie Review Mining and Summarization, DOI: 10.1145/1183614.1183625
3. Movie Review Mining: a Comparison between Supervised and Unsupervised Classification Approaches, DOI: 10.1109/HICSS.2005.445

Wednesday, April 24, 2013

Privacy in the Big Data era

We already have mountains of information in a variety of forms of data, such as plain texts in social media, spreadsheet form data about patients, and massive database provided publicly. When this kind of data is used, de-identification has been very crucial in order to prevent individuals from being victims of identity theft or from involving other type of crime. However, as the power of data processing drastically improves, re-identification is not impossible by analyzing the pattern of individuals' behavior. It seems very natural that many people concern the danger of development of big data technology.

Here is a paper that delivers the authors' thoughts on privacy in the Big Data time.

Big Data: Big Benefits

Google Flu Trends is a good example that can show the benefit of Big Data. It provides a service that predicts and locates outbreaks of the flu by making use of information - aggregate search engines. This service, early detection of disease, when followed by rapid response, can reduce the impact of both seasonal and pandemic influenza.

Traffic management and control is a field witnessing significant data-driven environmental innovation. By using electronic toll pricing systems, drivers pay depending on their use of vehicles and roads. Also, this management and control enables governments to potentially cut congestion and the emission of pollutants.

Big Data: Big Concerns

However, the harvesting of large data sets and the use of analytics implicate privacy concerns. Ensuring data security and protecting privacy become harder as information is multiplied and shared ever more widely around the world. If de-identification becomes a key component of business models, most notably in the contexts of health data, online behavioral advertising, and cloud computing, governments and businesses could be in more trouble.

What data is "Personal?"

It seems that there is no common idea even in the group of law scholars. Quoted Betsy Masiello and Alma Whitten,
"anonymized information will always carry some risk of re-identification. many of the most pressing privacy risks exist only if there is certainty in re-idenfication, that is if the information can be authenticated. As uncertainty is introduced into the re-identification equation, we cannot know that the information truly corresponds to a particular individual; it becomes more anonymous as larger amounts of uncertainty are introduced."

The authors did not present some tangible conclusion. Of course, this debate will be continuing. I think that the obvious thing on this debate is that attempts to harvest privacy data will be existing and counteraction against the attempts will also be deploying.

Reference: http://www.stanfordlawreview.org/online/privacy-paradox/big-data

Tuesday, April 23, 2013

Selection Bias in the NHL Draft

It has been a long standing practice in the National Hockey League to value slightly older players in the NHL draft. Relative Age Effect occurs when people who are relatively older than the rest of their peers for their age group are more likely to succeed. This phenomenon has been observed to reliably to occur in certain educational and athletic settings. A group of psychology professors have discovered that NHL teams have been biased towards slightly older players in the NHL draft. The research has shown that players that are born in the first three months of the year (relatively younger) are more likely to succeed in the NHL. The study looked at twenty-seven years of data from the NHL and found that relatively younger players have a much longer career. In the study they discovered that players who were born in between July and December accounted for 34% of the players drafted, but these players played in 42% of games, as well scored 44% of all the points. On the other hand, players who were born from January to March accounted for 36% of the players drafted, but only accounted for only 25% of the points and only played in 28% of the games. This discovery seems very odd to me. It doesn't seem like which part of the year you were born in would have a substantial effect on your career in the NHL. Also, this finding is in contrast with most other studies about Relative Age Effects, which state that relatively older individuals are more likely to succeed. Another study showed that most of the top prospects (40%) in the Canadian youth hockey leagues were born in the first three months of the year, while only 15% were born in the later part of the year. The study says they are not sure why this phenomenon had been occurring, just that it is an interesting finding and should merit further research and study. I found this study to be very interesting, who knew that what part of the year you were born in would affect how well you preformed in a sport.

Sources:

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0057753#abstract0

http://www.wired.com/playbook/2013/03/nhl-selection-bias/

Recommendation Algorithms

This is the day and age for recommendation algorithms. With such a diverse and seemingly infinitely large market, there is much room for a consultant that can tell you what you want. One really good example of a company that already uses a really well designed algorithm is Netflix. They effectively narrow down a list of thousands and thousands of movies and select a list of ten that you will most likely enjoy watching. It is actually quite scary, and by scary I mean correct, what solutions that they come up with. But why a list of ten? Wouldn't a perfect recommendation just be one thing, the thing you wanted to watch. One company, Stitch Fix, has done exactly this in the domain of apparel. This companies owner Eric Colson explains their business model in this video:

Weeding out the noise

While studying Big data, one might misinterpret how data mining works. You first must understand that information does not equal insight. While insight always entails information, information does not always entail insight. Dr. Michael Wu explains 3 criteria for information to provide valuable insights.

1. Interpretability. Because big data can be so unstructured and diverse there is a large amount of data that can be uninterpreted.

For example, consider this sequence of numbers: 123, 243, 187, 89, and 156. This data could mean a number of things. (Street addresses, the total minutes it takes to write a blog, number of candies in a bag) The point that Dr. Wu is making with this criteria is that, without the metadata to describe this data further you are unable to interpret and therefore cannot gain any insight from it.

2. Relevance. Information must be relevant in order for it to be of any use. Relevant info is sometimes referred to as a signal whereas irrelevant information is referred to as noise. But relevance is a very relative term. "Information that is relevant to me may be completely irrelevant to you, and vice versa. Relevance is not only subjective, it is also contextual. If I’m visiting NYC next week, then NYC traffic will suddenly become very relevant to me. But after I return to Alabama, the same information will instantly become irrelevant again."

.3. Novelty. Information must be novel, meaning that this information is new and does not tell you something that you already know.

Clearly this criteria is also very relative. It is quite obvious that something I know as old, you might find out as new, and something that i might find insightful you might not.

source: http://techcrunch.com/2012/11/25/the-big-data-fallacy-data-%E2%89%A0-information-%E2%89%A0-insights/

Big Data in Logistics

source: http://www.oracle.com/us/corporate/profit/opinion/021512-sswaminathan-1523937.html

Big data has shown that it can change an industry and it proves to do the same for third party logistics. So what does logistics mean? Logistics can mean a lot of things. Mainly it contains the section of supply chain management business that controls, implements and plans the transportation of goods and its efficiency in doing it. It also accounts for the storage of items and goods and services between salesmen and customers. Big data gives shippers, 3PLs (third party logistics), and carriers a whole new advantage over their market. Companies that use big data in the correct ways will get increased visibility of future opportunities.

Below are some ideas and applications of Big Data analytics in Logistics:

Source	Opportunity
Weblogs	Patterns that customers show when shopping at certain times of the year
Trailer tags	Insight into the times of which trucks arrive/leave and finding reasons for delay
Pallet/Case/SKU tags	Insight into the times of which packages arrive/leave and finding reasons for delay
Electronic on-board recorder	Insights into travel times, load/unload times, and driver hours
Mobile devices	Insights into mobile application usage by customers, partners, and employees
Social platforms	Customer insight —who “likes” your products, who has advocated your products, who has issues, and what their issues are

Machine Learning and Online Fraud

We all know that data mining has many extremely useful applications as this blog discusses a variety of them. In looking to expand my knowledge on the subject, I always look for topics on data mining different than the ones we discuss in class, one being using machine learning techniques to combat online fraudulence. The article states that most algorithms designed to detect fraudulence follow anywhere from 175 to 225 questions or rules. Like the rest of the world, those committing fraud are constantly changing and evolving, which does not present any good news to those trying to prevent it from happening. Ex-Google employees consequently sought to develop a new approach that would detect fraud before it occurs. They have developed the Sift system which actively applies to sites, creating millions of connections of fraudulent behaviors. New insights are already being developed as a result of this new tool. Such insights include but are not limited to the statistic that Yahoo users are five times more likely to create a fake email account than those that use G-mail.

More effective data mining as a result of machine learning will soon, if not already, out-perform existing agencies looking to detect fraudulent practices. Though these traditional techniques have worked in the past, the constant barage of information uploaded to the web will soon allow many criminals to fall through the cracks. Teaching a machine to essentially question online users based on individual activities will revolutionize the detection process, and hopefully deter hackers from trying to manipulate the internet, decreasing online fraud altogether. This will be especially useful to government agencies as well. It only makes sense that hackers continually change and adapt in order to remain anonymous. Previous systems designed to protect the public from fraudulence are adapting at a pace must slower than hackers. Consequently, fraudulence is not going anywhere. This Sift system is a huge break through in machine learning because it utilizes the predictive capabilities of the concept in way that can save the United States alone hundreds of millions of dollars a year as well as banks and the general public.

Link to article:
http://gcn.com/Articles/2013/03/26/Sift-Science-machine-learning-anti-fraud.aspx?Page=1

The Future of Data Mining - "Fast Data"

Firstly, here are sum statistics from the article I read for this particular blog post:

Every minute:

48 HOURS of video are uploaded on Youtube
204 million e-mails are sent
600 new websites pop up
600,000 pieces of content are shared on Facebook
Upwards of 100,000 tweets are sent

This article stresses the idea that data mining is time. Author Alissa Lorentz states that we must be able to mine data as quickly as we produce it. Because the of the plethora of electronic information available today, data mining is extremely important and an issue or concept of which I was previously not aware. Lorentz discusses the difference between smart data, data that provides insight to large data sets and big data, which is a term we apply to extremely large data sets. She then elaborates on a concept she calls "fast data." Fast data will eventually be extremely useful. It analyzes data sets in real time. If one were able to analyze all of the data available on a specific company in any given day in a meaningful way, let's just say I'd be looking at the stock market.

In class, we have discussed mainly archiving data, organizing data in a historical sense. This article discusses a different concept: streaming data i.e. streaming data live rather than storing it for future use. To me, this is ideal. Rather than storing messages on Facebook, providing users with a list compiled of a certain amount of friends that have recently been in contact on the social network would save memory and computing powers as well as be more useful to the user who has messages from conversations years ago. Also, in applying this concept to other situations, Lorentz talks about how streaming data would provide important information on traffic or public health issues such as flu outbreaks. With the abundance of information that is constantly being added to the web, storing and archiving this information will undoubtedly become obsolete. Instead of focusing on analyzing past data, after reading this article, I think the best direction in the data mining world would be to chase the data rather than store it. Updating data sets in real time would not only eliminate the need for large storage systems, but it would better indicate the trends occurring in the here and now.

Link to article:
http://www.wired.com/insights/2013/04/big-data-fast-data-smart-data/