Showing posts with label Big Data. Show all posts
Showing posts with label Big Data. Show all posts

Monday, April 29, 2013

Big Data Analytics of Eventual Super Bowl Winning teams

Jessica Clemmensen and I, Wilson Lyle, did our final project on big data analytics of eventual Super Bowl champions.

Here is the link to the video tutorial we recorded: http://www.youtube.com/watch?v=jX9K82aE4so

Below is the paper.  If anyone is interested in reading it as a PDF as opposed to a blog post, contact Jessica or me and we would be more than happy to send it to you.  Enjoy. 

Wilson Lyle: WWL0002@auburn.edu
Jessica Clemmensen: JLC0030@auburn.edu

 

Conference Paper


Big Data Analytics


Football Statistics Super Bowl Winning Teams



 

This paper was written as a summary for a project that was completed for INSY 4970 at Auburn University.

Unless otherwise specifically stated, the information contained herein is made available to the public by Jessica Clemmensen, Wilson Lyle and Auburn University for reference or non-profit use.  The intent of this paper was to further understand the sport of football and how success can be attained.

Neither Jessica Clemmensen, Wilson Lyle, Auburn University nor any other agency or entities thereof, assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, product or process disclosed in this paper.

Neither Jessica Clemmensen, Wilson Lyle, Auburn University nor any other agency or entities thereof, assumes any liability or responsibility for money or any other assets lost due to gambling based on the conclusions stated within the paper; this is a purely academic paper and should not be used for profit.  Any other use is prohibited.

Reference herein to any specific commercial product, process, service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement, recommendation, or favoring by Jessica Clemmensen, Wilson Lyle, Auburn University or any entities thereof.

The views and opinions expressed in this paper do not necessarily state or reflect those of Auburn University or the Auburn University Industrial and Systems Engineering department.


 

Executive Summary


At the beginning of the semester in INSY 4970, the project group decided to develop a big data, data mining project that would seek to develop a model that would accurately predict Super Bowl winners. In order to complete the project, the group downloaded statistics from ESPN’s website which lists statistics for all teams beginning in 2002 for regular and playoff seasons.

 

In order to narrow the scope of the project, a set of hypothesis and research questions were utilized to explore data within the regular and playoff seasons. While these questions focused on the statistical data from ESPN, miscellaneous circumstances were also considered such as emotional effects on player performance and circumstances surrounding the Super Bowl. Microsoft Excel, Minitab, and Orange Canvas software packages were used during data analysis.

 

The project group analyzed a variety of statistics, looking for trends within the data that was unique to Super Bowl winning teams. The team analyzed effects of first round byes, compared playoff performance to regular season performance, and attempted to define the concept of an “elite” quarterback during the playoff season. Regular season analysis included developing correlated characteristics of eventual winners, defining the term “hot” when referring to teams, examining offensive and defensive strategies, as well as establishing conference difficulty.

 

After data analysis was completed, the project team determined that there was not a complete model that would accurately predict the outcome of the Super Bowl for all teams over the last 11 seasons. However, there were characteristics that were common to all Super Bowl winning teams. Such characteristics are a balanced offense, as well as game strategy. Championship winning teams tend to balance both offense and defense, but both offense and defense are above average. Quarterbacks tend to improve between the regular and playoff seasons. Conference difficulty, historically, does not indicate which team will win the Super Bowl. Though emotional influence cannot be quantified, the project group believes this is a motivating factor in previous championship teams.

 


 

Introduction


The Super Bowl can be called the television event of the year, garnering a staggering 109 million viewers in 2013 alone. Viewers tune in to watch their favorite team try to capture the Lombardi Trophy, placing bets and yelling at the television. Americans across the country end their Sunday evenings at least once year happy or miserable, depending on whether or not the final score favored their wishes.

 

The score may indicate the outcome of the game, but what factors contribute to a team’s win? This question has many possible answers, but it requires an in-depth analysis of statistics for the winning teams of past Super Bowl games. If one could accurately predict the winner of the Super Bowl based on statistics, not only bragging rights could be earned but financial gain. In 2012, Super Bowl viewers wagered an estimated $93,899,840 on various bets. Predicting the winner of the Super Bowl with confidence could finance a college education, buy a new car or a trip abroad, and with so many viewing and actively invested in the game, there is a plethora of data to analyze.

 

The question the project team sought to answer is: are there certain patterns or trends that eventual Super Bowl winning teams show? In order to provide a solution to this question, it is necessary to develop an initial starting point. As a team, it was decided that the best approach would be to group questions by regular season and playoff season performance. The questions are listed below.

 

During the playoffs:

1. How many eventual winners had first round byes?

2. Does the way the teams perform during the regular season line up with how they played during the playoffs or do many teams just go on an unprecedented Championship run?

3. Most experts would say that you must have an “Elite QB” to win a Super Bowl (10 of last 12).  What makes a QB Elite?  Is this a catch-22? Many QB’s aren’t actually considered elite until after they win a super bowl.  This begs the question of how you determine if a QB is on his way to becoming elite.

 

During the regular season:

4. Are there specific stat categories that eventual champions excel at (PPG, OPPG, YPG, 4th down conversion percentage etc.)?

5. Many teams enter the playoffs “Hot” (Giants 2012, Packers 2011).  Do most eventual champions end the regular season “Hot”?  How do you consider a team “Hot”? The last 8 games? 4 games?

6. Do most eventual champions run similar offenses? Pass/Run/Balanced?

7. Do most eventual champions rely on their Defense? Offense? Neither?

8. Running back-by-committee has becoming increasingly popular in the last 10 years.  Do most eventual champions rely on a single back or multiple?

9. Does the eventual champion come from the “easier” conference?

 

The three italicized questions are believed to provide the most insight to the question at hand. The group also considered miscellaneous circumstances surrounding the Super Bowl such as Ray Lewis’ impending retirement this year and the effect of Hurricane Katrina in the 2010 match up.

 

Method


The group retrieved all data except AFC and NFC records from ESPN’s website (http://espn.go.com/nfl/statistcs).  AFC and NFC records were downloaded from a visualization website (http://visual.ly/afc-vs-nfc-records-year?view=true). The data was compiled using visual basic; however, at times during data compilation it was necessary to manually input data in Excel. The manually input data included byes, running back-by-committee, and data associated with miscellaneous circumstances. Though data collection included both manual and coded input components, it was all gathered from ESPN.com except for the aforementioned conference records. For data group members did not enter manually, Visual Basic Macros were written to compile and manipulate the data.  The Visual Basic Macros created workbooks which housed statistical data for seven categories: defense, downs, passing, receiving, returning, rushing and total, and each category’s data was stored in three workbooks. Of the three workbooks, the first held the statistics for the regular season for every team. The second consisted of statistics from the playoffs for all teams that made it.  The third workbook was comprised of all opposing team statistics for each team. Unfortunately, the group could not find anywhere that had playoff statistics for opposing teams, so the opposing team statistics were limited to the regular season. The group consequently did not analyze the opposing team statistics. All of the workbooks had essentially the same macro, which consisted of a single “for” loop. Each “for” loop; the Macro created a new spreadsheet with the name of the year, and then inserted the data from ESPN.com. Lastly, it deleted all of the unnecessary rows before and after the necessary data. Unnecessary data in these rows included links to other aspects of the ESPN website which did not pertain to the project. The only difference between workbooks was the specific URL because a different statistic was downloaded each time.  

 

The next step the group needed to complete was to compile all of the data into a single spreadsheet. This was accomplished through a two-step process. The first step involved creating a Macro that assembled all regular and post-season statistics of Super Bowl winning teams into a single workbook with each worksheet representing a year. The second step did not involve writing a Macro; rather, it proved to be simple. The group manually copied and pasted the 11 worksheets into a single worksheet in another workbook.

 

The group analyzed the data in a variety of ways. A number of conclusions could be made by simply looking at the data once it had been arranged in a single workbook. Other data needed formatting using Excel; however, the majority of the project group’s conclusions were drawn through data analysis in statistical software packages Minitab and Orange Canvas. When analyzing whether having a bye affected whether or not a team won the championship, it was as simple as looking at the spreadsheet and seeing that only five of the last 11 teams had first round byes.  The running back-by-committee and miscellaneous analysis were similar to the bye analysis.  Other categories utilized Excel in order to draw conclusions. These included offensive strategy, team strength and conference difficulty. They only required excel to calculate averages, medians, probabilities and standard deviations. There were two categories, the “meat” of the project, which were analyzed using statistical software. Minitab was used to compare regular season and playoff performances through a series of t-tests. When analyzing championship characteristics, the group used Orange Canvas software to create scatter-plots for each of all statistics downloaded. The group analyzed each statistic for all teams, coloring championship teams blue instead of red. This was done so Super Bowl winning teams were prominent in the scatter plots. Trends for these statistics were then noted and analyzed.

 

 

Results & Conclusions


Byes

The project group looked to answer the nine questions outlined in Research Hypothesis and Method. The initial question was established in order to look for correlations between eventual Super Bowl winning teams and whether or not these championship teams had a first round bye. A first round bye is awarded to a team if the team finishes first or second in their respective conference and won their division. This means that a team does not have a game the first week the playoff season begins. It also guarantees that first round bye teams play their first playoff game in their home stadium. Initially, the project group hypothesized that eventual Super Bowl winners would have, for the majority, first round byes. It was believed that because a team played well enough in the regular season to earn a first round bye, that they would be qualified to win the super bowl as well.  It was also believed that a first round home game would help a team win the game, and beginning the playoff season with a win would then help a team gather momentum and garner an emotional edge. The group predictions proved to be inaccurate. When analyzing the data, only five out of the 11 Super Bowl winning teams evaluated from 2002 to 2012 had a first round bye. The Saints, Steelers, Patriots (2), and Buccaneers had first round byes while the Colts, Giants (2), Steelers, Packers, and Ravens did not. The Steelers had a first round bye in 2008 but not in 2005. This indicates that the first round bye was not necessary for the Steelers to win the Super Bowl. The Patriots had first round byes for all four of their Super Bowl appearances in the last ten years; however, they only won two of the four games. Consequently, it cannot be determined that the first round byes were the reason the Patriots had successful Super Bowl appearances. Generally, championship teams have first round byes less than half of the time over the last ten years. Because more teams do not have first round byes than do, the project group concluded that first round byes do not appear to affect the outcome of the Super Bowl.

 

 

Playoff vs. Regular Season Performance

The project group also looked to evaluate whether or not a team’s regular season performance was indicative of the playoff season performance, noting that it was possible for a team to achieve unprecedented success in the playoff season due to a championship drive not executed in the regular season. The group analyzed regular and playoff season performances by subgrouping data into two categories: offense and defense. Offense performance was broken down into passing, rushing, and down conversions. Defense performance was evaluated through tracking the number of tackles and interceptions.

 

Quarterback Rating

Because the quarterback is the heartbeat of the passing game, analysis of the quarterback rating was hypothesized to increase after the regular season. The quarterback rating is a formula established by the NFL that essentially looks to assign a value to a team’s quarterback. The formula is comprised of various equations based on completions per attempt, touchdowns per attempt, interceptions per attempt, and yards per attempt. Dean Oliver of ESPN writes that the QBR was established in order to evaluate a quarterback’s expected number of points and probability of winning. The average quarterback rating for Super Bowl winning teams was 90.2 during the regular season and increased to 98.5 in the playoff season. A t-tail hypothesis test was performed in order to determine whether the average quarterback ratings between the regular and playoff season were the same. With a p-value of 0.15, the group reasoned that on average, quarterback ratings for eventual super bowl winning teams do not remain the same between playoff and regular seasons. Based on historical data, quarterbacks for champions outperform their regular season appearances.

 

Interceptions: Offense

Interceptions are also an indication of success in the passing game. A low value for interceptions per attempt indicates that the quarterback is accurate and precise in executing passing plays. Interceptions per attempt is averaged at 0.025 during the regular season for the past 10 years and decreases to 0.014 during the playoff season. Again, a t-tail hypothesis test was performed, which yielded a p-value of 0.047. Approximately 96% of the time, interceptions per attempt differs during the regular and playoff seasons, and interceptions per attempt is on average lower in the playoff season. This was expected as teams are not able to make mistakes such as interceptions and still win games. Reducing interceptions between the regular season and playoff season is a strong characteristic of Super Bowl winners.

 

Third Down Conversion Percentage

Third down conversion percentages allow offenses more time on the field and also demonstrate the effectiveness of an offense’s ability to continue to move down the field toward the opponent’s end zone. Third down conversion percentages t-tests produced a p-value of 0.112. Again, this indicates that the hypothesis that the average third down conversions for regular and playoff seasons are the same is not true. Regular season conversion percentages are 41.14 on average and 44.87 in the playoff season. Based on historical data, conversion percentages increase in the playoff season which is logical in that teams must move down the field more once less capable opponents are eliminated from the game pool.

 

These t-tests produced logical p-values. The statistical tests essentially demonstrate that quarterbacks of eventual championship winning teams play better in the playoffs than the regular season, which is to be expected.  It also makes sense interceptions per attempt decrease at a more significant level than quarterback rating increases, because winning teams do not turn the ball over.

 

Tackles & Defensive Interceptions

T-tests were also performed on defensive statistics. Teams averaged 62.92 tackles per game in the regular season and 63.28 in the playoffs. The p-value testing the equivalence of playoff and regular season average tackles is 0.015. This p-value suggests that the hypothesis that teams make the same amount of tackles in the playoff and regular seasons is not true. The project group found this particular statistic to produce surprising results as the average values for playoff and regular season tackles both round to 63 per game. Interceptions the defense is able to execute is 1.295 and 1.659 for regular and playoff seasons respectively. The p-value for this particular test is 0.128, so it would hold that defenses are able to complete varying average interceptions per game, historically increasing during the playoff season.

 

Elite Quarterbacks

Super Bowl winning teams have to be able to score which requires a strong offense. The quarterback captains the offense. Quarterbacks are called “elite” when they are likely to be inducted into the National Football League’s Hall of Fame. Since 2002, nine of the 11 teams had quarterbacks who are now considered elite. However, a definition or clear qualities to define an elite quarterback before this occurs would indicate that a team is likely to appear in and win the Super Bowl. The group initially hoped to provide a definition based on offensive statistics though this proved to be a question that remains unanswered. Eventual Super Bowl winning teams all have quarterbacks with a high QBR and high yards per attempt passing completions. These were the only qualities consistent across all championship teams analyzed; however, these alone do not classify a quarterback as elite as many non-elite quarterbacks meet this criteria. Concluding, Super Bowl winning teams had elite offensive leaders 82% of the time over the last ten years. Elite quarterbacks appear in the Super Bowl the majority of the time, yet there is no clear definition which will indicate whether a quarterback in the here and now will be considered elite in the future.

 

Championship Characteristics

Regular season performance determines whether or not a team is eligible for the playoffs which then determine the teams facing off for the Lombardi trophy. The project group sought to find characteristics of championship teams that were prominent during the regular season. Implementing scatter plots in Orange Canvas, the group looked at each statistic for these teams beginning in 2002 until present day. Of all the data, four statistics were consistent across all teams, with these teams predominantly clustered at an above average level. Such statistics being quarterback ratings, passing yards per attempt, rushing attempts per game, and sacks per game which is logical. A high quarterback rating indicates a high probability of a team winning games as well as a high number of expected points scored. This would demonstrate a qualified quarterback leading the offense which was characteristic of winning teams. An increase in passing yards per attempt and rushing attempts per game indicate that champions run the ball at a high rate and then capitalize on their passing attempts. Sacks per game illustrates defensive strengths which prevents opponents from outscoring Super Bowl winning teams; eventual winners are able to penetrate the defense and apply pressure on the opposing team’s quarterback and offense in general.

 

Classification: “Hot”

Sports analysts continually discuss whether a team is considered “hot” before entering the playoffs and whether or not this is a deciding factor in the Super Bowl. The group set out to determine whether or not data can be used to identify which teams are “hot” and whether said teams have a chance at winning the super bowl.  The group looked at how many eventual champions won their last game of the season, the record for their last five games of the regular season, and the record for their last eight games of the playoff season. Surprisingly, there was no consistency for any of the three. The group initially predicted that eventual champions would consistently win games, cementing their team strengths before playing in the Super Bowl. The Ravens, Super Bowl winner in 2012, went 1-4 in their last five games, including a loss for the final game of the year. Only one team in the last seven seasons won at least four of their last five games; that team being the Steelers who went 4-1. None of the last seven Super Bowl Champions have won the final five regular season games. Although the Ravens lost their last game of the regular season, many eventual champions do in fact win their last game.  The Giants (2011) and the Packers (2010) had to win their last two games of the season in order to be eligible for the playoffs. After looking at this data, the project group determined that what matters is how a team is playing upon entering the playoffs at the end of the regular season and not whether or not a team won their last five games. A great example was the Giants in the 2007 season. The Giants lost their last game of the season.  However, they lost to the New England Patriots who, after winning that game, went undefeated during the regular season.  The Giants played well, only losing by a three point margin which made them the team that was the closest to defeating the Patriots. The Giants rode this momentum through the playoffs, and eventually to the Super Bowl where they were able to defeat the Patriots when it mattered most.  Although the project group was able to conclude that a team’s performance matters more than a team win, data did not solidify a definition or classification tool to determine whether or not a team was “hot.”

 

Offensive Strategy

Eventual Super Bowl winners are able to effectively outscore their opponents, but this poses the question as to how champion offenses accomplish this whether it is through the passing or running game or a balance of the two. Historical data shows that champions’ passing attempts per game ranges between 30 and 35 for 17 of the 22 data points, and rushing attempts per game have a wider range between 25 and 33 for 18 of the 22 teams. The New England Patriots did not follow this trend when individually evaluating their passing attempts per game in the two playoff seasons before their Super Bowl appearance, averaging 27 and 42 passing attempts per game. However, when averaged, the Patriots passed the ball on average 34.5 times each game, which falls within the aforementioned range. The Giants were also outliers, averaging 38.31 and 40.75 passing attempts per game in the regular and playoff seasons. The Indianapolis Colts averaged 38.5 in the playoff season for passing attempts per game. The Baltimore Ravens, Colts, and Steelers fall out of the range for rushing attempts per game. The Ravens and Colts rushed the ball more during the playoff season, while the Steelers consistently rushed the ball more throughout the entire 2005 season. This can be attributed to running back Jerome Bettis who retired after 2005 and explains why the Steelers were not outliers of this range during their 2008 appearance. The ranges for passing and rushing attempts overlap. Because of this, the group concluded that overall, balanced offenses with the ability to effectively throw and run the ball are characteristic of Super Bowl winning teams. This observation was predicted as teams who consistently change their offensive strategy continually challenge opposing defenses, scoring points in ways that are varying and difficult to predict.

 

Team Strength: Offense vs. Defense

After analyzing champion offenses, it was necessary to determine whether winning teams utilized offensive strategies more so than defensive tactics or balanced both aspects of the team. In order to do this, the group compared average points per game, 24.76, and average opponent points per game, 20.07. The overall averages do not indicate a large difference between winning and losing teams; however, when the score margin is averaged, champions score on average 6.5 points more than their opponents, a touchdown more than defense allows. The group concluded that Super Bowl winning teams have a balanced strategy based on the low scoring margin. Champions collectively have above average offenses and defenses without having to particularly rely on either.

 

Running Back-by-Committee

The running back-by-committee strategy is becoming increasingly popular at the collegiate and professional level, with four of the last six champions utilizing it.  “Running back-by-committee” is defined as having at least two legitimate rushing threats who can line up in the backfield.  There are two advantages to this strategy. The first being that extra depth at the running back position will show as the defense line and linebackers grow tired during the course of the game while the running backs are able to continue to play as if it is the first quarter. The second ideology behind the strategy is the most important, which is that the opposing team will have to split up their time in the film room meaning a team will spend less time studying running backs individually. Oddly enough, a championship team that only used one running back validates this prediction. In 2010 the Packers ran the ball with Brandon Jackson during the regular season. The Packers drafted James Starks in the previous draft; however, he was injured in training camp and did not play until the conclusion of the regular season.  Then, in the first game of the playoffs, Starks set the franchise record for most yards in a postseason game for a rookie and continued to play at a high performance level for the duration of the playoffs. James Starks clearly has talent; however, he performed at a high level because teams did not have film on him, making it difficult to prepare for and thwart his running skills. The conclusion the group made was that utilizing a running back-by-committee strategy can be an advantage if there is qualified personnel; however, it is by no means necessary to win the Super Bowl.

 

Perceived Conference Difficulty

The project group predicted that the perceived difficulty of the conference would correlate with the Super Bowl winning teams. In order to execute this analysis, the group collected data over the 11 seasons on how many wins each conference had when facing an opponent from the other conference. Group predictions were that the winning conference, which won more games, would indicate the eventual Super Bowl winner, but in this particular case group predictions were wrong. Since 2002, the Super Bowl winning team was from the conference that won more games only 54% percent of the time. The probability that the AFC wins the Super Bowl and is the winning conference is around 40%. In contrast, the probability that the NFC wins the Super Bowl and is the losing conference is around 37%. Based on historical data, the conference difficulty only indicates the championship team roughly half of the time. The NFC is more likely to produce a Super Bowl winning team as a losing conference than the AFC which has a 14% chance of happening. The probability that the NFC wins as both a conference and in the Super Bowl is approximately 8%. With the 11 data points from 2002 to 2012, inter-conference record is not a consistent predictive Super Bowl indicator.

 

Emotional and Circumstantial Influence

Though data is important, the project group sought to consider the emotional aspects affecting a team and how emotion plays into the game of football.  Football is a strategic and calculated sport; however, players, coaches and fans are all emotionally invested. When players have other incentives for which they are playing, they are believed to play at their best for the duration of the game rather than noticing their body’s natural fatigue. Although emotion cannot be measured or quantified, the group still decided to explore this aspect of the game. Emotion has been evident in the last ten years, most recently this year with Ray Lewis and the Ravens.  Ray Lewis was injured in week six and sat out the rest of the regular season (11 weeks).  He not only returned for the playoffs but also announced that he would be retiring at the end of the season.  The fact that this would be his last attempt for a title combined with his pastoral pre-game speeches had the Ravens playing hard during all games throughout the playoffs.  Although Ray Lewis did not play at an all-star level during the championship run, his drive for the win and inspiration for his team, helped the Ravens play at a level that was necessary to end the season as champions.

 

Another example of an emotional advantage is the Saints run to the Super Bowl in 2009 after the emotional roller coaster that came in the wake of Hurricane Katrina. Katrina devastated the city of New Orleans, killing over 1500 people in the state of Louisiana and forcing 1.2 million people along the gulf coast to evacuate. Almost six months after Hurricane Katrina, the Saints opted to sign a quarterback who was undesired in the league because he was, at the time, returning from a shoulder surgery. That quarterback, Drew Brees, was thankful for a second chance. He moved into the city of New Orleans at a time when other players did not want to live in a city that required rebuilding. He invested himself in the city and helped with relief efforts. Fast forward to the 2009 season; Drew Brees has had successful seasons at this juncture. New Orleans is beginning to see an influx of residents and visitors once more, and life in the city is slowly moving back to its pre-Katrina pro quo. The difference in the city this time however is that the citizens of New Orleans have a phenomenal football team for which they can root. After a first round bye, on January 16, 2010, the Super Dome hosted a playoff game; four years earlier it was being used as a homeless shelter for the citizens of New Orleans. The Saints went on to win the Super Bowl after two victories in the Super Dome and one over the Colts in Miami.  After the game was over, Drew Brees was quoted as saying “We felt as if there was no way we could lose this game...This one is for you New Orleans.” The project group believes times like these can cause players to play at a level they ordinarily cannot reach.


 

Conclusions


After utilizing ESPNs rather large database of NFL statistics, the group was able to answer their hypothesis questions and draw conclusions.  The group determined there were three things that do not factor into a team winning the Super Bowl.  These are conference difficulty, whether or not they had a bye, and whether or not they used a running back-by-committee system.  The group determined that if a team has the proper personnel, then using multiple running backs can prove to be advantageous, however it is by no means a requirement to win the Super Bowl.

 

The group determined that how a team’s quarterback plays in the playoffs is indicative of how far that team goes.  Quarterbacks for eventual champions play better in the playoffs, with a higher QBR and a lower interception rate.  The rest of the team also plays better in the playoffs, historically having higher third down conversion rates and defensive interception rates.

 

There also proved to be consistencies with champions during the regular season.  Teams tend to have balances offensive attacks (rushing/passing) along with above average offenses and defenses but not to the point where teams have to rely on either one.  There proved to be more consistencies for Super Bowl teams with specific statistical categories.  The conclusions the group was able to draw are as follows: teams establish a solid running game and then capitalize on their passing opportunities.  Teams are also able to put pressure on quarterbacks and disrupt the running game in a similar manner.

 

The group found that there are also circumstances that affect Super Bowl outcome but cannot be quantified.  These include how “hot” a team is and whether or not teams have an emotional advantage.  It was determined that how a team is playing at the end of the season can cause them to be “hot,” however a win/loss record does not depict it.  Teams can also have an emotional edge if they have added incentives to play for, like the Saints in the wake of Hurricane Katrina. Though there are common factors among championship teams, there is not a definitive answer as to whether or not a team will win the Super Bowl.


 

Sources



 


 











           

 

Sunday, April 28, 2013

Tutorial: how to load the 1000 genomes data into Amazon Web Services

 The format of this tutorial is done such that it gives written instructions followed by a picture for that step.

Step 1.
Start by logging into AWS. Once you have done that, you will see this page. Click "EC2" virtual services in the cloud.
Step 2.
Click on "Launch Instance"

Step 3. The next page will say "launch with classic wizard." Just click "Continue."

Step 4. The next page will be titled "Request instances wizard." Just click "community AMIs tab".

Step 5. Next to the Viewing all images drop down field, type in"1000HumanGenomes."

Step 6. Once the AMIs have popped up, click select next to the first one.



Step 7. After that you will be taken to the instance types selection. Click the drop down arrow and select the type of instance you would like to use. I chose "M1 Large."

Step 8. Next you will be prompted to create a password in order to access your AMI. Type your password in the text field shown in the picture.




Step 9. Next you will be prompted with a "Storage device configuration" menu. Just click continue.

Step 10. It will ask you if you want to tag your instance. You can just click continue.



Step 11.

Next you will be prompted to enter your personal key pair. Enter your keypair into the text field marked in the photo.


Step 12.  Next, you will be prompted to enter your security group. Just select the default one.


Step 13. In  Step 13 you will be shown all the specifics you requested in the previous steps. Click Launch if they are all satisfactory.


Step 14. After that, you will be told that your instance is being launched. Click "close."



Step 15. In your instances section, check the Status Checks section. After a while, it should say "checks past."




Step 16.  After that, you are done. If you have a piece of software called Linux bio cloud on a computer with a Ubuntu Linux operating system, you should be able to work with the data!










Tuesday, April 23, 2013

OutWit Hub: Web-scraping made easy

I read a blog earlier this term on web-scraping and decided to check it out. I started with the suggested software, and quickly realized that there are only a few really good tools available for web-scraping and that are supported by Max OS. So, after reading a few reviews, I landed on OutWit Hub.

OutWit Hub has 2 versions: Basic and Pro. The difference is in available tools. In basic, the "words" tools isn't available. This aspect allows you to see the frequency of any word as it occurs on the page you are currently viewing. Several of the scraping tools are offline as well. I've upgraded to Pro, it's only $60 per year and I was curious to see what else it can do.

I'm not a computer scientist, by a long shot, but I have a general grasp on coding and how computers operate. For this reason, I really like OutWit Hub. The tutorials on this site are incredible. They walk you through examples and you can interact with the UI while the tutorial is going. Also, a lot of the tools are pretty intuitive to use. If you're not sold on getting the Pro version, I'd encourage you to visit their website and download the free version just to check out the tutorials. They're really great.

I've used the site for several examples just to test. I needed to get all of the emails off of an organization's website, so instead of copy/pasting everything and praying for the best, I used the "email" feature on OutWit and all of the names and emails of every member on the page populated an exportable table. #boom

Then, I wanted to see if it could be harnessed for Twitter and Facebook. So, using the source-code approach to scraping, I was able to extract text from the loaded parts of my Twitter and Facebook feeds. The problems I encountered were: Not knowing enough about the coding to make the scraper dynamic enough to peruse through unloaded pages, and not knowing how to automate and build a larger dataset (i.e. continuously run the scraper over a set amount of time by continuously reloading the page and harvesting the data. It's possible, I just didn't figure it out).

So, I've videoed a tutorial on how to use OutWit Hub Pro's scraper feature to scrape the loaded part of your Facebook news feed. Below are the written instructions and the video at the bottom gives you the visual.

Essentially, you will:
1.) Launch OutWit Hub (presuming you've downloaded and upgraded to Pro).
2.) Login to your profile on Facebook.
3.) Take note of whatever text you want to capture as a reference point when you go to look in the code. This is assuming you don't know how to read html. For example, if the first person on your news feed says: "Hey check out this video!", then take note of their statement "Hey check out this video!"
4.) Click the "scrapers" item on the left side of the screen.
5.) In the search window, type in the text "Hey check out this video" and observe the indicators in the code that mark the beginning and end of that text.
5.) In the window below the code, click the "New" button.
6.) Type in a name for the scraper
7.) Click the checkbox in row 1 of the window.
8.) Enter a title/description for the information you're collecting in the first column. Using the same example: "Stuff friends say on FB" or "Text". It really only matters if you're going to be extracting other data from the same page and want to keep it separate.
9.) Type in the html code that you indicated as the beginning to the data that you want to extract under the "Marker Before" column.
10.) Repeat step 9 for the next column using the html code that you indicated as the end to the data.
11.) Click "Execute".
12.) Your data is now available for export in several templates - CSV, Excel, SQL, HTML, TXT

Here is a Youtube video example of me using it to extract and display comments made by my Facebook friends that appeared on my news feed.